Skip to content

studio: tool calling for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5615

Merged
danielhanchen merged 2 commits into
mainfrom
studio-tools-multi-format
May 19, 2026
Merged

studio: tool calling for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5615
danielhanchen merged 2 commits into
mainfrom
studio-tools-multi-format

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

See PR description below — full summary written inline.

The shared tool_call_parser used by safetensors and MLX now recognises
the canonical emission shapes for the most popular families so the
agentic loop sees the same call shape llama-server normalises for
GGUF.  Patched against llama.cpp's per-family parsers (common/chat-
parser.cpp, legacy pre-PEG branch at 34df42f7be), vLLM's
tool_parsers/, and SGLang's function_call/ modules.

Formats covered:

  Qwen / Hermes      <tool_call>{json}</tool_call>
  Qwen3.5 / Hermes   <function=name><parameter=k>v</parameter></function>
  Llama-3 built-in   <|python_tag|>NAME.call(k="v", ...)
  Llama-3 custom     <|python_tag|>{"name":..., "parameters":...}
  Llama-3.2 bare     {"name":..., "parameters":...}    (no marker)
  Mistral pre-v11    [TOOL_CALLS] [{"name":..., "arguments":...}, ...]
  Mistral v11+       [TOOL_CALLS]name{json}            (may chain)
  Ministral / Large 3 [TOOL_CALLS]name[ARGS]{json}
  Gemma 4            <|tool_call>call:NAME{k:<|"|>v<|"|>}<tool_call|>

All parsers normalise to OpenAI shape
``{id, type:"function", function:{name, arguments(json_string)}}``.
Truncated emissions (unclosed brackets, missing close tags) are
tolerated -- balanced-brace walkers fall back to per-object healing
so a mid-stream cut does not lose the call.

Llama-3.2 bare-JSON parser is strict: it only fires when stripped
content starts with ``{`` and the parsed object has ``name`` (str)
plus a dict in ``parameters`` or ``arguments``.  Plain assistant
prose, tool-message echoes, and JSON missing those keys all leave
it dormant.

routes/inference._detect_safetensors_features now allows templates
whose tool-call format is any of the seven supported markers; the
gate still suppresses ``supports_tools`` for templates that
advertise tools but use a shape the parser cannot honour, so the UI
never enables a pill the loop will not return.

Streaming buffer wakes up on five markers (was two) so the
safetensors / MLX state machine drains tool calls instead of leaking
them as prose:

  TOOL_XML_SIGNALS = (
      "<tool_call>", "<function=",
      "<|python_tag|>", "[TOOL_CALLS]", "<|tool_call>",
  )

The route-layer markup-strip regex ``_TOOL_XML_RE`` is extended to
match every closed-pair shape, including Mistral v11+ ``name{json}``
and Llama-3 ``<|python_tag|>...\n`` so leaked markup is removed from
SSE / non-streaming completions across all five families.

Tests: 37 new unit tests covering each emission shape (parser +
streaming buffer + strip_tool_markup + agentic loop), 11 bare-JSON
edge cases guarding against false positives, and 4 new capability
advertise tests pinning the gate to recognise Llama-3 / Mistral /
Gemma 4 / Llama-3.2 bare-JSON templates as supports_tools=True
while still suppressing tools for unknown emission formats.

The previous suppression tests (Llama-3 template suppresses tools,
Mistral template suppresses tools) are inverted to assert the new
gate keeps tools enabled for those families -- the loop now
supports them end to end.

Cross-OS validation (ubuntu / macos-14 / windows) lives on the
staging fork: danielhanchen/unsloth-staging-2 #126, which exercises
the multi-format parser against 9 representative fixtures plus the
existing macos-14 MLX Qwen3.5-0.8B cartesian probe.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5c707fbe7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +36 to +42
TOOL_XML_SIGNALS = (
"<tool_call>",
"<function=",
"<|python_tag|>",
"[TOOL_CALLS]",
"<|tool_call>",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detect bare Llama JSON before streaming it

When a Llama-3.2 custom-tools template emits the newly supported bare {"name":..., "parameters":...} shape, there is no marker from this tuple in the output. In the safetensors loop, BUFFERING therefore treats the first { as ordinary text and switches to STREAMING, and the end-of-stream safety path only calls parse_tool_calls_from_text when has_tool_signal(content_accum) is true, so the bare JSON parser added below is never reached; the raw tool call is shown as the assistant answer and no tool executes. Please add a streaming/final detection path for the bare JSON shape or run the final parser even without one of these signals.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the tool-call parsing capabilities to support a wide range of model emission formats, including Llama-3, Llama-3.2, Mistral (pre-v11 and v11+), and Gemma 4. It updates the inference routes to correctly detect these capabilities in chat templates and provides regex-based stripping for the new markers. Feedback focuses on improving the robustness of regex patterns for Mistral and Llama-3 to handle nested structures and multi-line content, as well as correcting a potential character corruption issue when decoding escape sequences in Llama-3 tool calls.

Comment on lines +55 to +57
re.compile(r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?", re.DOTALL),
# Mistral v11+ ``[TOOL_CALLS]name{json}`` (may chain), close at ``}``.
re.compile(r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}", re.DOTALL),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex patterns for Mistral tool calls in _TOOL_CLOSED_PATS use non-greedy matching (.*?) ending at the first ] or }. This will cause incorrect partial matches and stripping if the tool arguments contain nested arrays or objects (e.g., [TOOL_CALLS]name{"a": {"b": 1}} would match up to the first }). This leads to leaked markup and corrupted text in the UI during streaming and final cleanup.

Comment on lines +451 to +452
r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?",
r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These Mistral stripping patterns suffer from the same issue as those in tool_call_parser.py: the non-greedy .*? will stop at the first nested bracket or brace within the JSON arguments, causing partial stripping and leaking the remainder of the tool call to the user.

args: dict[str, Any] = {}
for kv in _LLAMA3_KV_RE.finditer(body):
k = kv.group(1)
if kv.group(2) is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Decoding escape sequences by encoding to UTF-8 and then using unicode_escape will corrupt non-ASCII characters. For example, a string like "München\n" will be incorrectly decoded as "München\n". Since the regex already ensures the content is a valid quoted string, json.loads is a safer and more robust way to decode standard escape sequences while preserving Unicode characters.

Suggested change
if kv.group(2) is not None:
args[k] = json.loads('"' + kv.group(2) + '"')

r"<\|tool_call>.*?<tool_call\|>",
r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?",
r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}",
r"<\|python_tag\|>[^\n<]*",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Llama-3 stripping regex r"<\|python_tag\|>[^\n<]*" is too restrictive and will fail to strip multi-line JSON tool calls (common in Llama-3.1/3.2 custom tools). It stops at the first newline, leaving the rest of the JSON visible to the user. A more robust approach for this tag (which lacks a formal closing tag) is to match until the next potential tag start or the end of the string.

Suggested change
r"<\|python_tag\|>[^\n<]*",
r"<\|python_tag\|>.*?(?=<|$)",

@danielhanchen danielhanchen merged commit af35ed8 into main May 19, 2026
1 of 31 checks passed
@danielhanchen danielhanchen deleted the studio-tools-multi-format branch May 19, 2026 14:14
danielhanchen added a commit that referenced this pull request May 19, 2026
…sors + MLX (#5615)" (#5619)

Reverts PR #5615 to give the safetensors + MLX healing parity work more time to bake before re-merging. The reverted feature branch `studio-tools-multi-format` remains untouched, and the follow-up PR will layer the healing-parity commits on top.
@danielhanchen

Copy link
Copy Markdown
Member Author

Re-reviewed after the fact (this PR was already reverted in #5619, with the replacement under #5620). Confirming the revert was the right call — pr-5615 shipped four concrete parser bugs that are end-user-visible:

  1. Mistral nested-JSON truncation. \[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\} (tool_call_parser.py:55-56) is non-greedy on }. For [TOOL_CALLS]search{"filters":{"date":"2024"},"q":"foo"} it stripped only up to the inner }, leaking ,"q":"foo"} to the user. Fixed in studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX #5620 via _strip_mistral_closed_calls + _balanced_brace_end.
  2. <|python_tag|> stop-on-<. <\|python_tag\|>[^\n<]* (routes/inference.py:452) stops at any <. A Llama-3 call like <|python_tag|>python.call(code="if x < 10: pass") got stripped to < 10: pass"). Fixed in studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX #5620 to [^\n]*.
  3. Llama-3 sentinel single-pass. The fixed-order for sentinel in (...) loop (tool_call_parser.py:402-415) silently dropped calls when the stream contained <|eot_id|><|begin_of_text|>{json} because begin_of_text was tested before eot_id consumed its prefix. Fixed by the while True / matched loop in studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX #5620.
  4. UTF-8 corruption in Llama-3 KV decoder. bytes(s, "utf-8").decode("unicode_escape") (tool_call_parser.py:419-422) mangles non-ASCII bytes — "café日本"'caféæ\x97¥æ\x9c¬'. Any tool call with emoji/CJK silently corrupted arguments. Fixed via json.loads('"' + value + '"') in studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX #5620.

Plus a fifth that motivated the parity package in #5620:
5. GGUF heal key always query. llama_cpp.py:4852 healed bare-string args on python/terminal calls to {"query": …} instead of {"code": …} / {"command": …}, so the python sandbox / terminal never executed the content. Safetensors had _CANONICAL_HEAL_ARG here in #5615; GGUF gets parity in #5620 at llama_cpp.py:4861-4870.

For the record: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q is green on both pr-5615 (96 passed) and pr-5620 (110 passed) — pr-5615's own suite simply did not cover any of the five regressions above, which is why they shipped. #5620 adds 14 dedicated tests (TestLoopRePrompt, TestLoopCanonicalHealKey) that lock the fixes down.

No cleanup needed on this PR — #5619 reverted the merge cleanly.

danielhanchen added a commit that referenced this pull request May 27, 2026
Comments were narrating what the code already says. Cut historical
"earlier revisions used X, then Y" narratives down to one-line WHY
notes where the footgun still matters (canonical heal-key parity,
balanced-brace vs non-greedy regex, ``(?:[^<]|<(?!\|))*`` over
``[^\n<]*``/``[^\n]*``). Drop section-header banners.

No behaviour change. Re-ran:
  pytest studio/backend/tests/test_safetensors_tool_loop.py \
         studio/backend/tests/test_safetensors_capability_advertise.py -q
  -> 118 passed.
Regression replay (parser + _coerce_arguments on the 5 #5615 inputs)
  -> 21/21.
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…LX (unslothai#5615)

Adds tool calling for Llama-3, Mistral (pre-v11 + v11+ + [ARGS]), and Gemma 4 to the safetensors / transformers and MLX backends. Parser patched against llama.cpp / vLLM / SGLang per-family parsers and normalises to OpenAI shape. 96 targeted unit tests + cross-OS staging CI (ubuntu / macos-14 / windows) green on the multi-format probe.
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…sors + MLX (unslothai#5615)" (unslothai#5619)

Reverts PR unslothai#5615 to give the safetensors + MLX healing parity work more time to bake before re-merging. The reverted feature branch `studio-tools-multi-format` remains untouched, and the follow-up PR will layer the healing-parity commits on top.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant