studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX by danielhanchen · Pull Request #5620 · unslothai/unsloth

danielhanchen · 2026-05-19T14:28:41Z

DRAFT - holding open while we validate end to end with real models. Do not merge until safetensors + MLX behaviour is verified for every covered family.

…LX (#5615) Adds tool calling for Llama-3, Mistral (pre-v11 + v11+ + [ARGS]), and Gemma 4 to the safetensors / transformers and MLX backends. Parser patched against llama.cpp / vLLM / SGLang per-family parsers and normalises to OpenAI shape. 96 targeted unit tests + cross-OS staging CI (ubuntu / macos-14 / windows) green on the multi-format probe.

After the multi-format parser landed in #5615, the safetensors / MLX agentic loop and the GGUF loop still differed on healing behaviour. This commit closes the gaps in both directions so the two backends react the same way to identical model output. Changes: 1. core/inference/llama_cpp.py -- the GGUF BUFFERING state machine now wakes on every emission marker the shared parser knows. Was ("<tool_call>", "<function="); is now the five-tuple imported from core.inference.tool_call_parser (Qwen / Qwen3.5 / Llama-3 <|python_tag|> / Mistral [TOOL_CALLS] / Gemma 4 <|tool_call>). Stream cleanup is delegated to the same shared strip_tool_markup so leaked markup from any family is removed from assistant content. 2. core/inference/llama_cpp.py -- per-tool canonical heal key. When a tool arguments field is a bare string and JSON parsing fails, the GGUF path now heals to {"code": raw_args} for python, {"command": raw_args} for terminal, and {"query": raw_args} for everything else. Was hard-coded to {"query": raw_args}, which silently routed every python / terminal emission through web_search. Mirrors safetensors_agentic._CANONICAL_HEAL_ARG. 3. core/inference/safetensors_agentic.py -- re-prompt on plan- without-action. When the model emits a short forward-looking intent ("I'll search for that", "Let me check", "First, I will...") and no tool call, the loop nudges the model to act instead of silently returning a plan-only answer. Up to _MAX_REPROMPTS=3 (matches GGUF). The intent regex, character cap, and instruction text are byte-identical to the GGUF path. The buffer-end fall-through is unified so a buffered intent emission that never exits the BUFFERING state still triggers the re-prompt. 4. core/inference/safetensors_agentic.py -- extra iteration slots for re-prompts. The loop now budgets max_tool_iterations + _MAX_REPROMPTS + 1 total iterations and tracks the tool-call count separately, so a stalling model can be nudged 3x without eating the caller's tool-call budget. Mirrors the _extra slot reservation in the GGUF path. Tests (14 new safetensors-side units; 5 GGUF parity pins): TestLoopRePrompt -- intent-trigger, plain-answer, no-tools, cap-at-three, budget preserved, buffer-end intent. TestLoopCanonicalHealKey -- python / terminal / unknown. TestGGUFSafetensorsHealingParity -- shared markers used, shared strip used, canonical heal keys identical, intent regex matches same phrases, _MAX_REPROMPTS equal on both backends. All 110 targeted tests pass locally; the broader tool / inference / model-config / sandbox / anthropic / mlx suites stay green. Why this matters Without this parity, Llama-3.2 / Mistral / Gemma 4 emissions on Mac (MLX) and Linux-safetensors stop the agentic loop as soon as the model says "Let me...", because the GGUF re-prompt logic never existed on these backends. The two-marker GGUF BUFFERING tuple also let non-Qwen tool emissions stream out as plain prose when llama-server's structured channel did not pick them up. Both paths now drain the same way, heal the same way, and re-prompt the same way -- so a tool call that works on GGUF works identically on safetensors / MLX.

for more information, see https://pre-commit.ci

Mirrors the healing-parity follow-up commit from PR unslothai#5620: 1. GGUF BUFFERING tuple now uses the shared 5-format TOOL_XML_SIGNALS so non-Qwen tool emissions wake the state machine (was only 2 markers). 2. GGUF stream cleanup delegates to the shared strip_tool_markup so leaked markup from any family is removed from assistant content. 3. GGUF per-tool canonical heal key (python -> code, terminal -> command, * -> query) when arguments is a bare string. 4. Safetensors / MLX re-prompt on plan-without-action with _MAX_REPROMPTS=3 + extra iteration slots so re-prompts do not eat the tool budget. Also pulls in core/tool_healing.py which staging was missing (the legacy two-format helper module that llama_cpp.py imports the regex constants from).

gemini-code-assist

Code Review

This pull request significantly expands the tool-call parsing capabilities to support multiple model families, including Llama-3, Mistral, and Gemma 4, while ensuring parity between GGUF and safetensors backends. It introduces a re-prompting mechanism to nudge models that express intent without taking action and implements per-tool canonical heal keys. Feedback highlights potential issues with regex patterns for Mistral and Llama-3 that could lead to incorrect truncation or leaked markup, a bug in string unescaping that corrupts non-ASCII characters, and an order-dependency in sentinel stripping logic.

gemini-code-assist · 2026-05-19T14:36:32Z

+    re.compile(r"<\|tool_call>.*?<tool_call\|>", re.DOTALL),
+    re.compile(r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?", re.DOTALL),
+    # Mistral v11+ ``[TOOL_CALLS]name{json}`` (may chain), close at ``}``.
+    re.compile(r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}", re.DOTALL),


The regex for Mistral v11+ tool calls uses a non-greedy .*? to match the JSON arguments, which will incorrectly truncate the block at the first closing brace }. If the model emits nested JSON (e.g., {"a": {"b": 1}}), the stripping logic will leave trailing garbage in the assistant stream. A similar issue affects the Mistral pre-v11 array pattern on line 55 if the array contains nested brackets.

gemini-code-assist · 2026-05-19T14:36:32Z

+            k = kv.group(1)
+            if kv.group(2) is not None:
+                try:
+                    args[k] = bytes(kv.group(2), "utf-8").decode("unicode_escape")


Using unicode_escape on a UTF-8 encoded byte string will corrupt literal non-ASCII characters (e.g., emojis or non-Latin scripts) emitted by the model. For example, a literal ✨ (U+2728) becomes â\x9c¨ after this transformation. Since the regex already ensures that internal quotes are escaped, a safer way to unescape the string while preserving literal UTF-8 is to use json.loads on the quoted value.

Suggested change

args[k] = bytes(kv.group(2), "utf-8").decode("unicode_escape")

args[k] = json.loads('"' + kv.group(2) + '"')

gemini-code-assist · 2026-05-19T14:36:32Z

+    for sentinel in (
+        "<|begin_of_text|>",
+        "<|eot_id|>",
+        "<|start_header_id|>",
+        "<|end_header_id|>",
+        "<|eom_id|>",
+    ):
+        stripped = stripped.lstrip()
+        if stripped.startswith(sentinel):
+            stripped = stripped[len(sentinel) :]


gemini-code-assist · 2026-05-19T14:36:32Z

+            r"<\|tool_call>.*?<tool_call\|>",
+            r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?",
+            r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}",
+            r"<\|python_tag\|>[^\n<]*",


The regex for stripping Llama-3 tags is too restrictive. By stopping at the first newline or < character, it will leak the remainder of the tool call to the user if the call contains a < (common in queries or code) or spans multiple lines. This results in internal markup being visible in the chat UI during streaming.

Three high-priority gemini findings on the tool-call parsing additions: 1. unicode_escape on UTF-8 bytes corrupts non-ASCII literals (e.g. ✨ becomes â\x9c¨). Replace with json.loads on a quoted string -- preserves emoji / CJK / RTL while still handling \n \t \uXXXX escapes. 2. Llama-3 sentinel stripping is order-dependent. A leading `<|eot_id|><|begin_of_text|>` left `<|begin_of_text|>` behind because the loop had already passed that sentinel. Loop until no sentinel matches at the start. 3. Mistral v11+ `[TOOL_CALLS] name { json }` regex uses non-greedy `\{.*?\}` which truncates at the first `}` of a nested JSON argument, leaking the tail (e.g. `}}`) into user-visible streamed text. Same problem for the v0.3 array pattern with nested brackets. Strip those with balanced brace/bracket scanning via a new `_strip_mistral_closed_calls` helper called from `strip_tool_markup`. Also fix the inference routes' parallel `_TOOL_XML_RE`: - Same nested-JSON truncation in the Mistral patterns; route the strip through the parser's balanced-scan helper via a thin `_strip_tool_xml` wrapper that all existing callers now use. - Llama-3 `<|python_tag|>[^\n<]*` stopped at any `<`, leaking the tail of any tool call whose argument contained a literal `<` (queries, code snippets). Relax to `[^\n]*` which keeps the strip confined to the actual end-of-line.

for more information, see https://pre-commit.ci

danielhanchen · 2026-05-21T13:04:29Z

Looks safe to land. Pulled pr-5620 and verified against pr-5615.

What 5620 adds beyond 5615 (parser/inference scope):

File	LOC	What's new
`tool_call_parser.py`	+137	Balanced brace/bracket Mistral stripping, sentinel-loop, UTF-8-safe KV decoder
`safetensors_agentic.py`	+82	Intent re-prompt loop, `_MAX_REPROMPTS=3`, BUFFERING fall-through to STREAMING
`llama_cpp.py`	+39	Shared `TOOL_XML_SIGNALS` + per-tool canonical heal key (parity with safetensors)
`routes/inference.py`	+46	All 9 `_TOOL_XML_RE.sub` sites switched to `_strip_tool_xml()`
`tests/test_safetensors_tool_loop.py`	+253	`TestLoopRePrompt` (6) + `TestLoopCanonicalHealKey` (3)

Each of the five #5615 regressions has a fix + test in 5620 — details in #5615/#5619.

Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q → 110 passed in 3.86s (had to pip install structlog — missing dev dep, not a code issue).

Worth watching when the validation phase finishes:

False-positive tool detection on prose JSON. _parse_llama3_bare_json (tool_call_parser.py:486-547) triggers on any leading { with name + parameters|arguments. The strict parameters/arguments dict guard mitigates this, but a user echoing back {"name":"foo","arguments":{}} in a code fence isn't stripped first — worth a quick fence-aware preamble check.
python_tag multi-line code. The full parser (_parse_llama3_python_tag) handles multi-line via _balanced_brace_end, but the routes-level _strip_tool_xml() helper is single-line ([^\n]*). Multi-line python.call(code="…\n…") will have only the first line stripped at the route layer; remaining lines may leak in non-parser paths.
Partial Mistral buffer. _strip_mistral_closed_calls correctly leaves unclosed [TOOL_CALLS]… in place, but the BUFFERING state machine caps at _MAX_BUFFER_CHARS=32 (safetensors_agentic.py:44). A long [TOOL_CALLS] very_long_name [ARGS] {… that exceeds 32 chars before reaching { may flush to user. Worth a test that names a tool with a long identifier and asserts no early flush.
Re-prompt bound is _MAX_REPROMPTS=3 and _tool_iters_done = iteration + 1 - reprompt_count (safetensors_agentic.py:447), with test_max_reprompts_capped_at_three (test_safetensors_tool_loop.py:893) covering the cap. Looks correct.

Holding the [DRAFT] flag while the end-to-end real-model validation finishes is the right call. Once that's green, this is ready to merge.

Earlier revisions of _TOOL_XML_RE in studio.backend.routes.inference oscillated between two bug shapes: 5615 r"<\|python_tag\|>[^\n<]*" -- stopped at any literal "<" so code='if x < 10: pass' leaked '< 10: pass)' to the user. 5620.1 r"<\|python_tag\|>[^\n]*" -- single-line only; the second line of python.call(code="a\nb") leaked. The full parser (_parse_llama3_python_tag) already handles both via balanced-brace scanning, so the parsing path was fine; the LEAK was in the streaming strip path that runs on every cumulative emission while content is still arriving. Switch to r"<\|python_tag\|>(?:[^<]|<(?!\|))*" so the strip consumes: * any character that is not a "<" (newlines, JSON, code, ...), * a "<" only when it is NOT followed by "|" (i.e. NOT a Llama-3 sentinel start like <|eot_id|>, <|eom_id|>, <|begin_of_text|>). This means: * code='if x < 10' stays inside the strip (5615 fix preserved), * multi-line code stays inside the strip (5620 round 2), * the strip terminates at the next Llama-3 sentinel so trailing assistant content survives. Tests: TestRoutesPythonTagStrip (8 cases) pytest test_safetensors_tool_loop.py test_safetensors_capability_advertise.py -> 118 passed in 1.81s (was 110).

for more information, see https://pre-commit.ci

danielhanchen · 2026-05-22T08:19:33Z

Pushed 7ef4b115 closing the routes-level multi-line python_tag gap flagged earlier. The full parser already handled this via balanced-brace scanning; the leak was only on the streaming strip path that runs while content is still arriving.

Earlier history of _TOOL_XML_RE's python_tag clause:

Revision	Pattern	Bug
#5615	`<\|python_tag\|>[^\n<]*`	Stopped at literal `<` so `code="if x < 10"` leaked `< 10: pass)`
#5620 round 1	`<\|python_tag\|>[^\n]*`	Single-line only — the second line of `python.call(code="a\nb")` leaked
this commit	`<\|python_tag\|>(?:[^<]\|<(?!\|))*`	Consumes everything that is NOT a Llama-3 `<\|sentinel\|>` start. Multi-line code, embedded JSON, and literal `<` in code all stay inside the strip.

pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q → 118 passed in 1.81s (was 110, +8 new tests in TestRoutesPythonTagStrip covering single-line, less-than-in-code, multi-line, multi-line-with-less-than, sentinel-stop, and JSON-form multi-line cases).

The _MAX_BUFFER_CHARS=32 concern I raised earlier turned out to be unfounded — the BUFFERING state machine treats a fully arrived [TOOL_CALLS] as a match (length 12, well under 32) and transitions to DRAINING, so a long-tool-name [TOOL_CALLS] very_long_name [ARGS] {… never accumulates 32 chars in BUFFERING. The 32-char cap only applies to partial signal prefixes (e.g. <|py waiting to complete to <|python_tag|>), and the longest signal is 14 chars (<|python_tag|>), so the cap is already more than 2× the worst-case prefix length.

The remaining false-positive concern (a model echoing back {"name":"foo","arguments":{}} as the very first non-whitespace characters of an assistant response triggering _parse_llama3_bare_json) is bounded by the strict parameters/arguments dict guard, which keeps plain prose unaffected — happy to add a fence-aware preamble check in a follow-up if you want belt-and-suspenders on that path too.

Each PR ran the same staged source files before, which went stale when the upstream PR commits advanced. Refactor to one job per PR with an actions/checkout of that PR's head ref, so cross-OS validation always uses the latest commit: - PR unslothai#5603 sandbox -> studio-sandbox-hardening - PR unslothai#5620 parser parity -> studio-tools-multi-format-v2 - PR unslothai#5696 mtp reload guards -> followup-mtp-reload-guards (unslothai#5582 followup) - PR unslothai#5695 lockfile audit -> followup-lockfile-audit-regressions (unslothai#5604 followup) 4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap). cancel-in-progress per (workflow, ref) keeps iteration cheap. All tests stay CPU-only and rely on the CUDA spoof harness in tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU is required on any runner.

unslothai#5620 parser tests transitively import the safetensors loop, which needs the datasets package. unslothai#5696 route-guard tests import routes/inference.py, which transitively imports core/training (uses matplotlib). Add both, plus the auth deps (pyjwt/cryptography/ aiosqlite/python-multipart) needed for any test that touches the FastAPI route module so route-level imports resolve cleanly on all three OSes.

Conflict in studio/backend/routes/inference.py: combine #5620's multi-format strip (Llama-3 <|python_tag|>, Gemma 4 <|tool_call>... <tool_call|>, Mistral [TOOL_CALLS] via parser helper) with main's speculative-buffer leak fixes (orphan open to EOF, bare orphan close, tail-only </parameter> anchored to \Z so mid-text <parameter> in user code samples survives). pytest studio/backend/tests/test_safetensors_tool_loop.py \ studio/backend/tests/test_safetensors_capability_advertise.py -q -> 118 passed.

for more information, see https://pre-commit.ci

Comments were narrating what the code already says. Cut historical "earlier revisions used X, then Y" narratives down to one-line WHY notes where the footgun still matters (canonical heal-key parity, balanced-brace vs non-greedy regex, ``(?:[^<]|<(?!\|))*`` over ``[^\n<]*``/``[^\n]*``). Drop section-header banners. No behaviour change. Re-ran: pytest studio/backend/tests/test_safetensors_tool_loop.py \ studio/backend/tests/test_safetensors_capability_advertise.py -q -> 118 passed. Regression replay (parser + _coerce_arguments on the 5 #5615 inputs) -> 21/21.

for more information, see https://pre-commit.ci

Three surgical extensions to the multi-format tool-call parser, each covering a real fine-tune / template emission shape that the current parser silently drops. No path narrows; all changes widen what is accepted. 1. `_parse_tool_call_json` now accepts both `arguments` and `parameters` keys. A Hermes / Qwen `<tool_call>{json}</tool_call>` wrapper around a Llama-3.2 fine-tune that emits the `parameters` key was extracting the tool name and silently discarding the args, producing a working-shaped call with an empty payload. The bare-JSON and python_tag paths already accepted both keys; this path now matches them. 2. `_TC_FUNC_START_RE`, `_TC_PARAM_START_RE`, and `_TC_PARAM_CLOSE_RE` now also match the attribute form `<function name="..."><param name="...">v</param></function>` used by MiniCPM-5 and MiniMax-M2. Names land in either capture group, and `</param>` is accepted as a short close. 3. `_parse_llama3_bare_json` sentinel-strip now consumes the role label inserted between `<|start_header_id|>` and `<|end_header_id|>` by Meta's official Llama-3.x chat template. Without this, every assistant turn re-fed through the template prefix `<|start_header_id|>assistant<|end_header_id|>\n\n{json}` parsed to zero calls, so any history-with-tool-call round-trip in production silently dropped. Tests in `studio/backend/tests/test_safetensors_tool_loop.py`: * `TestParserRobustness::test_tool_call_json_accepts_parameters_key` * `TestParserRobustness::test_function_xml_attribute_form` * `TestParserRobustness::test_function_xml_attribute_form_multi_param` * `TestParserRobustness::test_function_xml_legacy_equals_form_still_works` (regression guard for the existing `<function=name>` syntax) * `TestParserRobustness::test_llama3_chat_template_round_trip` * `TestParserRobustness::test_llama3_round_trip_all_roles` * `TestParserRobustness::test_llama3_round_trip_with_eot_prefix` `pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q` goes from 118 to 125 passed.

for more information, see https://pre-commit.ci

Two conflicts, both in the tool-call markup-strip plumbing: * `studio/backend/core/inference/tool_call_parser.py` -- main widened `<function=NAME>` and `<parameter=NAME>` from `\w+` to `[\w-]+` so MCP tool ids with hyphens (mcp__srv__list-issues, `repo-name`, `issue-number`) parse the same as the built-ins. PR-5620 had broadened the same patterns to `[\w\.\-]+` (also handles dotted module names) and added the python_tag / Mistral / Gemma 4 strip families plus the `TOOL_XML_SIGNALS` constant. Resolved by keeping PR-5620's full pattern set; `[\w\.\-]+` already accepts the hyphens main was widening to, so no MCP coverage is lost. * `studio/backend/routes/inference.py` -- `_TOOL_XML_RE` regex on the route-level scrub path. Same hyphen widening on main vs. PR-5620's union of (closed pair / orphan open / Gemma 4 / python_tag / `</parameter>` tail-only). Resolved by unioning all five patterns and applying main's `[\w-]+` on the `function=` name char class so hyphenated MCP names get stripped on the route too. pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q -> 125 passed.

Four fixes addressing review of the parent commit: 1. GLM <arg_value> coercion: tighten the json.loads -> ast.literal_eval -> raw cascade to only deserialize when the body unambiguously looks like a JSON literal (object, array, JSON-encoded string, true/false/null, or numeric). Strings like ``True`` / ``None`` (Python literals, not JSON) and arbitrary prose now stay raw. The bare-numeric / bare-boolean ambiguity with string args remains an inherent limitation of the template without schema access -- documented in the new comment. Drops the ast import entirely (closes Gemini's :1036 suggestion). 2. Kimi K2 bare-counter ids (e.g. ``<|tool_call_begin|>3``) are now dropped rather than surfaced as a tool literally named "3". Matches vLLM behaviour; SGLang's schema-infer fallback is out of scope at the parse site. Real Kimi K2 emissions use ``functions.NAME:IDX`` so this is the exception path. 3. Restore the elaborate ``<|python_tag|>(?:[^<]|<(?!\|))*`` clause in routes.inference._TOOL_XML_RE -- the simpler ``[^\n<]*`` form regressed PR unslothai#5620's multi-line / literal-``<`` python_tag fix. Restore ``TestRoutesPythonTagStrip`` (8 tests) adapted to call ``_TOOL_XML_RE.sub`` directly since the ``_strip_tool_xml`` helper was inlined this PR. 4. Add the spaced and backslash-escaped DeepSeek opener variants (``<｜tool calls begin｜>``, ``<｜tool\_calls\_begin｜>``) to ``TOOL_XML_SIGNALS`` for streaming-gate parity with ``_DEEPSEEK_BEGIN_RE``. Also updates the llama.cpp / vLLM citations in the parser docstrings: ``common/chat-parser.cpp`` was split into ``common/chat.cpp`` + ``common/chat-peg-parser.cpp`` by llama.cpp PR #18675, and vLLM moved the tool parsers from ``vllm/entrypoints/openai/tool_parsers/`` to ``vllm/tool_parsers/``. Pin to pre-refactor commit ``51fa458a92d6`` where the cited line numbers still resolve. New regression tests in ``test_pr5624_regressions.py`` cover the GLM coercion heuristic shapes, GLM literal-``<`` in arg_value, Kimi K2 dotted name, Kimi K2 bare-counter drop, DeepSeek V3.1 truncated mid-stream, and routes-layer strip across all three new families. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 170 passed in 1.91s

…all> `_parse_function_xml` was looking for `</tool_call>` (the Hermes wrapper) as the body terminator. When a model emits a standalone `<function=NAME><parameter=K>v</parameter></function>` followed by explanatory prose (which models routinely do), no `</tool_call>` is present, so the body extended to end-of-string and the trailing prose leaked into the LAST parameter value. Pre-existing on main (the legacy `<function=NAME>` form had this bug too). Same affects PR #5620's new attribute-form `<function name="NAME"><param name="K">v</param></function>` emission used by MiniCPM-5 / MiniMax-M2. Fix: `_TC_END_TAG_RE` now matches either `</tool_call>` OR `</function>`. The existing `_TC_FUNC_CLOSE_RE` / `_TC_PARAM_CLOSE_RE` strips are unchanged. Multi-call inputs still bound each function at the next `<function=` start, so no over-eager consumption. New tests: * `test_function_xml_followed_by_prose` (legacy form + prose) * `test_function_attribute_xml_followed_by_prose` (attribute form + prose) Existing `test_code_with_embedded_xml` still passes (a parameter value containing literal `<a></a>` is preserved because the embedded close tag is `</a>`, not `</function>`). `pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q` goes from 125 to 127 passed.

A fuzz pass on PR #5811 turned up that ``_parse_llama3_bare_json`` accepted ``parameters`` as a string, contradicting the docstring's "parameters or arguments is a dict" guard. Prose JSON like ``{"name":"foo","parameters":"a sentence"}`` would wrongly fire the parser, which the agentic loop would then heal into a real ``foo(query="a sentence")`` call. Same code lives on this branch, so the same fix applies here. Tightened guard: - ``parameters`` must be a dict (Llama-3 spec). - ``arguments`` may be a dict, or a JSON-encoded string that decodes to a dict (OpenAI shape, e.g. ``"arguments":"{\"q\":\"x\"}"``). Plain non-JSON strings or JSON-strings of lists / scalars / null no longer pass. Mirrors the fix landed in PR #5811 commit 615b860. Adds the same 4 regression tests under TestParserMultiFormat. Existing test suite stays green: 127 -> 131 passing.

The JSON sub-path of ``_parse_llama3_python_tag`` was fabricating ``{"value": args}`` when the model emitted a non-dict / non-string ``arguments`` value (e.g. ``42``, ``[1,2,3]``, ``null``, ``true``). This silently turned a malformed emission into a real tool call, which the agentic loop would then execute with arguments the model never intended. Tightened: skip the call instead of fabricating. The same behaviour now matches the bare-JSON guard tightened earlier (strict-guard merge from PR unslothai#5620, inherited via merge here). Added a regression test covering the four non-scalar shapes. Pass count on this branch: 158 -> 159. Sites in ``_parse_tool_call_json`` and ``_consume_mistral_call`` keep the existing looser behaviour for now; both are reached only after explicit ``<tool_call>`` / ``[TOOL_CALLS]`` markers so the false-positive surface there is much narrower.

…ALL_ID / THINK, attribute-form signal) Three GGUF-parity fixes to the safetensors tool-call parser, each matching llama.cpp's reference behaviour: - Mistral Small 3.2 emits [TOOL_CALLS]name[CALL_ID]<id>[ARGS]{json}. The parser stopped after the name on seeing [CALL_ID] (neither [ARGS] nor {), dropping the call. Skip an optional [CALL_ID]<id> segment in both the parse and strip paths. llama.cpp parses this (test-chat.cpp:4785). - Magistral wraps reasoning in [THINK]...[/THINK]. A [TOOL_CALLS] inside the reasoning was parsed as a real call, producing a phantom call. Strip a leading [THINK] block before scanning so only the post-reasoning call counts (test-chat.cpp:2285); a literal [THINK] inside a later argument is left intact. - The standalone MiniCPM-5 / MiniMax-M2 <function name="..."> attribute form parsed correctly but was absent from TOOL_XML_SIGNALS and the markup strip patterns, so the streaming safety-net parse was gated off (dropping the call) and markup leaked into displayed text. Add the signal and broaden the strip regexes. Adds regression tests for all three.

for more information, see https://pre-commit.ci

The agentic loop's streaming safety-net parse was gated on has_tool_signal(), which is False for the Llama-3.1 / 3.2 bare-JSON tool form {"name":..,"parameters":..} (no XML marker). Real tool calls were therefore dropped: the loop logged "model planned without calling tools", re-prompted three times, then gave up with zero tool calls, while GGUF's llama-server parses the same emission natively. Run parse_tool_calls_from_text() unconditionally in the safety net. The parser is strict (only fires on a valid tool-call shape) so plain answers are unaffected. Reproduced on a real unsloth/Llama-3.1-8B-Instruct run: the model emits {"name":"web_search","parameters":{...}} which now executes the tool instead of being re-prompted into a no-op. Adds a loop regression test for the bare-JSON form.

…to studio-tools-multi-format-v2

danielhanchen · 2026-05-31T15:47:36Z

Confirmed the Llama-3 side of this on a B200 against the matching unsloth UD-Q4_K_XL GGUF. Llama-3.1-8B safetensors fires tools on 18/18 runs (6 prompts, 3 seeds, same generation params as the GGUF), matching the GGUF 18/18 and passing 5/6 of the deterministic code checks vs 2/6 for the GGUF. The bare-JSON form {"name": ..., "parameters": ...} is the key case, since the skip-special-token stream drops <|python_tag|> and only the multi-format parser recovers it.

Mistral-Small-3.2 refuses to call tools in both backends for these prompts, so that one is model behaviour rather than the parser. Full per-family numbers and the cross-checks against llama.cpp, vLLM, and SGLang are on the stacked PR #5624.

danielhanchen and others added 3 commits May 19, 2026 14:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

f790680

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

shimmyshimmer and others added 2 commits May 19, 2026 14:56

[pre-commit.ci] auto fixes from pre-commit.com hooks

7d8e725

for more information, see https://pre-commit.ci

Daniel Han and others added 2 commits May 22, 2026 08:18

[pre-commit.ci] auto fixes from pre-commit.com hooks

0b4fa12

for more information, see https://pre-commit.ci

danielhanchen and others added 4 commits May 27, 2026 12:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

7f9177a

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b3a6c6

for more information, see https://pre-commit.ci

danielhanchen mentioned this pull request May 27, 2026

Studio: re-introduce multi-format tool calling with parser bug fixes #5811

Draft

danielhanchen and others added 3 commits May 27, 2026 13:31

[pre-commit.ci] auto fixes from pre-commit.com hooks

1919e2a

for more information, see https://pre-commit.ci

danielhanchen and others added 2 commits May 27, 2026 14:23

danielhanchen and others added 2 commits May 31, 2026 13:17

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a02120

for more information, see https://pre-commit.ci

danielhanchen added 2 commits May 31, 2026 13:51

Merge remote-tracking branch 'origin/studio-tools-multi-format-v2' in…

85751a1

…to studio-tools-multi-format-v2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5620

studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5620
danielhanchen wants to merge 20 commits into
mainfrom
studio-tools-multi-format-v2

danielhanchen commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

danielhanchen commented May 21, 2026

Uh oh!

danielhanchen commented May 22, 2026

Uh oh!

danielhanchen commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	args[k] = bytes(kv.group(2), "utf-8").decode("unicode_escape")
	args[k] = json.loads('"' + kv.group(2) + '"')

Uh oh!

Conversation

danielhanchen commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 21, 2026

Uh oh!

danielhanchen commented May 22, 2026

Uh oh!

danielhanchen commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants