Skip to content

studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5620

Draft
danielhanchen wants to merge 20 commits into
mainfrom
studio-tools-multi-format-v2
Draft

studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5620
danielhanchen wants to merge 20 commits into
mainfrom
studio-tools-multi-format-v2

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

DRAFT - holding open while we validate end to end with real models. Do not merge until safetensors + MLX behaviour is verified for every covered family.

danielhanchen and others added 3 commits May 19, 2026 14:27
…LX (#5615)

Adds tool calling for Llama-3, Mistral (pre-v11 + v11+ + [ARGS]), and Gemma 4 to the safetensors / transformers and MLX backends. Parser patched against llama.cpp / vLLM / SGLang per-family parsers and normalises to OpenAI shape. 96 targeted unit tests + cross-OS staging CI (ubuntu / macos-14 / windows) green on the multi-format probe.
After the multi-format parser landed in #5615, the safetensors / MLX
agentic loop and the GGUF loop still differed on healing behaviour.
This commit closes the gaps in both directions so the two backends
react the same way to identical model output.

Changes:

1. core/inference/llama_cpp.py -- the GGUF BUFFERING state machine
   now wakes on every emission marker the shared parser knows. Was
   ("<tool_call>", "<function="); is now the five-tuple imported
   from core.inference.tool_call_parser (Qwen / Qwen3.5 / Llama-3
   <|python_tag|> / Mistral [TOOL_CALLS] / Gemma 4 <|tool_call>).
   Stream cleanup is delegated to the same shared strip_tool_markup
   so leaked markup from any family is removed from assistant
   content.

2. core/inference/llama_cpp.py -- per-tool canonical heal key. When
   a tool arguments field is a bare string and JSON parsing fails,
   the GGUF path now heals to {"code": raw_args} for python,
   {"command": raw_args} for terminal, and {"query": raw_args} for
   everything else. Was hard-coded to {"query": raw_args}, which
   silently routed every python / terminal emission through
   web_search. Mirrors safetensors_agentic._CANONICAL_HEAL_ARG.

3. core/inference/safetensors_agentic.py -- re-prompt on plan-
   without-action. When the model emits a short forward-looking
   intent ("I'll search for that", "Let me check", "First, I
   will...") and no tool call, the loop nudges the model to act
   instead of silently returning a plan-only answer. Up to
   _MAX_REPROMPTS=3 (matches GGUF). The intent regex, character
   cap, and instruction text are byte-identical to the GGUF path.
   The buffer-end fall-through is unified so a buffered intent
   emission that never exits the BUFFERING state still triggers
   the re-prompt.

4. core/inference/safetensors_agentic.py -- extra iteration slots
   for re-prompts. The loop now budgets max_tool_iterations +
   _MAX_REPROMPTS + 1 total iterations and tracks the tool-call
   count separately, so a stalling model can be nudged 3x without
   eating the caller's tool-call budget. Mirrors the _extra slot
   reservation in the GGUF path.

Tests (14 new safetensors-side units; 5 GGUF parity pins):

  TestLoopRePrompt                 -- intent-trigger, plain-answer,
                                      no-tools, cap-at-three, budget
                                      preserved, buffer-end intent.
  TestLoopCanonicalHealKey         -- python / terminal / unknown.
  TestGGUFSafetensorsHealingParity -- shared markers used, shared
                                      strip used, canonical heal keys
                                      identical, intent regex matches
                                      same phrases, _MAX_REPROMPTS
                                      equal on both backends.

All 110 targeted tests pass locally; the broader tool / inference /
model-config / sandbox / anthropic / mlx suites stay green.

Why this matters

Without this parity, Llama-3.2 / Mistral / Gemma 4 emissions on Mac
(MLX) and Linux-safetensors stop the agentic loop as soon as the
model says "Let me...", because the GGUF re-prompt logic never
existed on these backends. The two-marker GGUF BUFFERING tuple also
let non-Qwen tool emissions stream out as plain prose when
llama-server's structured channel did not pick them up. Both paths
now drain the same way, heal the same way, and re-prompt the same
way -- so a tool call that works on GGUF works identically on
safetensors / MLX.
danielhanchen added a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
Mirrors the healing-parity follow-up commit from PR unslothai#5620:

1. GGUF BUFFERING tuple now uses the shared 5-format
   TOOL_XML_SIGNALS so non-Qwen tool emissions wake the state
   machine (was only 2 markers).
2. GGUF stream cleanup delegates to the shared strip_tool_markup
   so leaked markup from any family is removed from assistant
   content.
3. GGUF per-tool canonical heal key (python -> code, terminal
   -> command, * -> query) when arguments is a bare string.
4. Safetensors / MLX re-prompt on plan-without-action with
   _MAX_REPROMPTS=3 + extra iteration slots so re-prompts do
   not eat the tool budget.

Also pulls in core/tool_healing.py which staging was missing
(the legacy two-format helper module that llama_cpp.py imports
the regex constants from).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the tool-call parsing capabilities to support multiple model families, including Llama-3, Mistral, and Gemma 4, while ensuring parity between GGUF and safetensors backends. It introduces a re-prompting mechanism to nudge models that express intent without taking action and implements per-tool canonical heal keys. Feedback highlights potential issues with regex patterns for Mistral and Llama-3 that could lead to incorrect truncation or leaked markup, a bug in string unescaping that corrupts non-ASCII characters, and an order-dependency in sentinel stripping logic.

re.compile(r"<\|tool_call>.*?<tool_call\|>", re.DOTALL),
re.compile(r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?", re.DOTALL),
# Mistral v11+ ``[TOOL_CALLS]name{json}`` (may chain), close at ``}``.
re.compile(r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}", re.DOTALL),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex for Mistral v11+ tool calls uses a non-greedy .*? to match the JSON arguments, which will incorrectly truncate the block at the first closing brace }. If the model emits nested JSON (e.g., {"a": {"b": 1}}), the stripping logic will leave trailing garbage in the assistant stream. A similar issue affects the Mistral pre-v11 array pattern on line 55 if the array contains nested brackets.

k = kv.group(1)
if kv.group(2) is not None:
try:
args[k] = bytes(kv.group(2), "utf-8").decode("unicode_escape")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using unicode_escape on a UTF-8 encoded byte string will corrupt literal non-ASCII characters (e.g., emojis or non-Latin scripts) emitted by the model. For example, a literal (U+2728) becomes â\x9c¨ after this transformation. Since the regex already ensures that internal quotes are escaped, a safer way to unescape the string while preserving literal UTF-8 is to use json.loads on the quoted value.

Suggested change
args[k] = bytes(kv.group(2), "utf-8").decode("unicode_escape")
args[k] = json.loads('"' + kv.group(2) + '"')

Comment on lines +405 to +414
for sentinel in (
"<|begin_of_text|>",
"<|eot_id|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|eom_id|>",
):
stripped = stripped.lstrip()
if stripped.startswith(sentinel):
stripped = stripped[len(sentinel) :]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sentinel stripping logic is order-dependent and will fail to remove multiple sentinels if they appear in an order different from the tuple (e.g., <|eot_id|><|begin_of_text|> will leave <|begin_of_text|> behind because the loop already passed it). It should use a loop that repeatedly checks for any of the sentinels at the start of the string.

Suggested change
for sentinel in (
"<|begin_of_text|>",
"<|eot_id|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|eom_id|>",
):
stripped = stripped.lstrip()
if stripped.startswith(sentinel):
stripped = stripped[len(sentinel) :]
while True:
stripped = stripped.lstrip()
found = False
for sentinel in (
"<|begin_of_text|>",
"<|eot_id|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|eom_id|>",
):
if stripped.startswith(sentinel):
stripped = stripped[len(sentinel) :]
found = True
break
if not found:
break
References
  1. Combine checks and transformations into a single loop to improve efficiency and handle multiple sentinels correctly.

Comment thread studio/backend/routes/inference.py Outdated
r"<\|tool_call>.*?<tool_call\|>",
r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?",
r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}",
r"<\|python_tag\|>[^\n<]*",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex for stripping Llama-3 tags is too restrictive. By stopping at the first newline or < character, it will leak the remainder of the tool call to the user if the call contains a < (common in queries or code) or spans multiple lines. This results in internal markup being visible in the chat UI during streaming.

shimmyshimmer and others added 2 commits May 19, 2026 14:56
Three high-priority gemini findings on the tool-call parsing additions:

  1. unicode_escape on UTF-8 bytes corrupts non-ASCII literals
     (e.g. ✨ becomes â\x9c¨). Replace with json.loads on a quoted
     string -- preserves emoji / CJK / RTL while still handling
     \n \t \uXXXX escapes.

  2. Llama-3 sentinel stripping is order-dependent. A leading
     `<|eot_id|><|begin_of_text|>` left `<|begin_of_text|>` behind
     because the loop had already passed that sentinel. Loop until
     no sentinel matches at the start.

  3. Mistral v11+ `[TOOL_CALLS] name { json }` regex uses non-greedy
     `\{.*?\}` which truncates at the first `}` of a nested JSON
     argument, leaking the tail (e.g. `}}`) into user-visible
     streamed text. Same problem for the v0.3 array pattern with
     nested brackets. Strip those with balanced brace/bracket
     scanning via a new `_strip_mistral_closed_calls` helper called
     from `strip_tool_markup`.

Also fix the inference routes' parallel `_TOOL_XML_RE`:

  - Same nested-JSON truncation in the Mistral patterns; route the
    strip through the parser's balanced-scan helper via a thin
    `_strip_tool_xml` wrapper that all existing callers now use.
  - Llama-3 `<|python_tag|>[^\n<]*` stopped at any `<`, leaking the
    tail of any tool call whose argument contained a literal `<`
    (queries, code snippets). Relax to `[^\n]*` which keeps the
    strip confined to the actual end-of-line.
@danielhanchen

Copy link
Copy Markdown
Member Author

Looks safe to land. Pulled pr-5620 and verified against pr-5615.

What 5620 adds beyond 5615 (parser/inference scope):

File LOC What's new
tool_call_parser.py +137 Balanced brace/bracket Mistral stripping, sentinel-loop, UTF-8-safe KV decoder
safetensors_agentic.py +82 Intent re-prompt loop, _MAX_REPROMPTS=3, BUFFERING fall-through to STREAMING
llama_cpp.py +39 Shared TOOL_XML_SIGNALS + per-tool canonical heal key (parity with safetensors)
routes/inference.py +46 All 9 _TOOL_XML_RE.sub sites switched to _strip_tool_xml()
tests/test_safetensors_tool_loop.py +253 TestLoopRePrompt (6) + TestLoopCanonicalHealKey (3)

Each of the five #5615 regressions has a fix + test in 5620 — details in #5615/#5619.

Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q110 passed in 3.86s (had to pip install structlog — missing dev dep, not a code issue).

Worth watching when the validation phase finishes:

  • False-positive tool detection on prose JSON. _parse_llama3_bare_json (tool_call_parser.py:486-547) triggers on any leading { with name + parameters|arguments. The strict parameters/arguments dict guard mitigates this, but a user echoing back {"name":"foo","arguments":{}} in a code fence isn't stripped first — worth a quick fence-aware preamble check.
  • python_tag multi-line code. The full parser (_parse_llama3_python_tag) handles multi-line via _balanced_brace_end, but the routes-level _strip_tool_xml() helper is single-line ([^\n]*). Multi-line python.call(code="…\n…") will have only the first line stripped at the route layer; remaining lines may leak in non-parser paths.
  • Partial Mistral buffer. _strip_mistral_closed_calls correctly leaves unclosed [TOOL_CALLS]… in place, but the BUFFERING state machine caps at _MAX_BUFFER_CHARS=32 (safetensors_agentic.py:44). A long [TOOL_CALLS] very_long_name [ARGS] {… that exceeds 32 chars before reaching { may flush to user. Worth a test that names a tool with a long identifier and asserts no early flush.
  • Re-prompt bound is _MAX_REPROMPTS=3 and _tool_iters_done = iteration + 1 - reprompt_count (safetensors_agentic.py:447), with test_max_reprompts_capped_at_three (test_safetensors_tool_loop.py:893) covering the cap. Looks correct.

Holding the [DRAFT] flag while the end-to-end real-model validation finishes is the right call. Once that's green, this is ready to merge.

Daniel Han and others added 2 commits May 22, 2026 08:18
Earlier revisions of _TOOL_XML_RE in studio.backend.routes.inference
oscillated between two bug shapes:

  5615    r"<\|python_tag\|>[^\n<]*"   -- stopped at any literal "<"
                                         so code='if x < 10: pass'
                                         leaked '< 10: pass)' to the
                                         user.
  5620.1  r"<\|python_tag\|>[^\n]*"    -- single-line only; the second
                                         line of
                                         python.call(code="a\nb")
                                         leaked.

The full parser (_parse_llama3_python_tag) already handles both via
balanced-brace scanning, so the parsing path was fine; the LEAK was
in the streaming strip path that runs on every cumulative emission
while content is still arriving.

Switch to r"<\|python_tag\|>(?:[^<]|<(?!\|))*" so the strip consumes:

  * any character that is not a "<" (newlines, JSON, code, ...),
  * a "<" only when it is NOT followed by "|" (i.e. NOT a Llama-3
    sentinel start like <|eot_id|>, <|eom_id|>, <|begin_of_text|>).

This means:

  * code='if x < 10' stays inside the strip (5615 fix preserved),
  * multi-line code stays inside the strip (5620 round 2),
  * the strip terminates at the next Llama-3 sentinel so trailing
    assistant content survives.

Tests: TestRoutesPythonTagStrip (8 cases)
  pytest test_safetensors_tool_loop.py test_safetensors_capability_advertise.py
    -> 118 passed in 1.81s (was 110).
@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed 7ef4b115 closing the routes-level multi-line python_tag gap flagged earlier. The full parser already handled this via balanced-brace scanning; the leak was only on the streaming strip path that runs while content is still arriving.

Earlier history of _TOOL_XML_RE's python_tag clause:

Revision Pattern Bug
#5615 <|python_tag|>[^\n<]* Stopped at literal < so code="if x < 10" leaked < 10: pass)
#5620 round 1 <|python_tag|>[^\n]* Single-line only — the second line of python.call(code="a\nb") leaked
this commit <|python_tag|>(?:[^<]|<(?!|))* Consumes everything that is NOT a Llama-3 <|sentinel|> start. Multi-line code, embedded JSON, and literal < in code all stay inside the strip.

pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q118 passed in 1.81s (was 110, +8 new tests in TestRoutesPythonTagStrip covering single-line, less-than-in-code, multi-line, multi-line-with-less-than, sentinel-stop, and JSON-form multi-line cases).

The _MAX_BUFFER_CHARS=32 concern I raised earlier turned out to be unfounded — the BUFFERING state machine treats a fully arrived [TOOL_CALLS] as a match (length 12, well under 32) and transitions to DRAINING, so a long-tool-name [TOOL_CALLS] very_long_name [ARGS] {… never accumulates 32 chars in BUFFERING. The 32-char cap only applies to partial signal prefixes (e.g. <|py waiting to complete to <|python_tag|>), and the longest signal is 14 chars (<|python_tag|>), so the cap is already more than 2× the worst-case prefix length.

The remaining false-positive concern (a model echoing back {"name":"foo","arguments":{}} as the very first non-whitespace characters of an assistant response triggering _parse_llama3_bare_json) is bounded by the strict parameters/arguments dict guard, which keeps plain prose unaffected — happy to add a fence-aware preamble check in a follow-up if you want belt-and-suspenders on that path too.

danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 22, 2026
Each PR ran the same staged source files before, which went stale when
the upstream PR commits advanced. Refactor to one job per PR with an
actions/checkout of that PR's head ref, so cross-OS validation
always uses the latest commit:

  - PR unslothai#5603 sandbox            -> studio-sandbox-hardening
  - PR unslothai#5620 parser parity      -> studio-tools-multi-format-v2
  - PR unslothai#5696 mtp reload guards  -> followup-mtp-reload-guards (unslothai#5582 followup)
  - PR unslothai#5695 lockfile audit     -> followup-lockfile-audit-regressions (unslothai#5604 followup)

4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap).
cancel-in-progress per (workflow, ref) keeps iteration cheap.

All tests stay CPU-only and rely on the CUDA spoof harness in
tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU
is required on any runner.
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 22, 2026
unslothai#5620 parser tests transitively import the safetensors loop, which
needs the datasets package. unslothai#5696 route-guard tests import
routes/inference.py, which transitively imports core/training (uses
matplotlib). Add both, plus the auth deps (pyjwt/cryptography/
aiosqlite/python-multipart) needed for any test that touches the
FastAPI route module so route-level imports resolve cleanly on all
three OSes.
danielhanchen and others added 4 commits May 27, 2026 12:19
Conflict in studio/backend/routes/inference.py: combine #5620's
multi-format strip (Llama-3 <|python_tag|>, Gemma 4 <|tool_call>...
<tool_call|>, Mistral [TOOL_CALLS] via parser helper) with main's
speculative-buffer leak fixes (orphan open to EOF, bare orphan close,
tail-only </parameter> anchored to \Z so mid-text <parameter> in user
code samples survives).

pytest studio/backend/tests/test_safetensors_tool_loop.py \
       studio/backend/tests/test_safetensors_capability_advertise.py -q
-> 118 passed.
Comments were narrating what the code already says. Cut historical
"earlier revisions used X, then Y" narratives down to one-line WHY
notes where the footgun still matters (canonical heal-key parity,
balanced-brace vs non-greedy regex, ``(?:[^<]|<(?!\|))*`` over
``[^\n<]*``/``[^\n]*``). Drop section-header banners.

No behaviour change. Re-ran:
  pytest studio/backend/tests/test_safetensors_tool_loop.py \
         studio/backend/tests/test_safetensors_capability_advertise.py -q
  -> 118 passed.
Regression replay (parser + _coerce_arguments on the 5 #5615 inputs)
  -> 21/21.
danielhanchen and others added 3 commits May 27, 2026 13:31
Three surgical extensions to the multi-format tool-call parser, each
covering a real fine-tune / template emission shape that the current
parser silently drops. No path narrows; all changes widen what is
accepted.

1. `_parse_tool_call_json` now accepts both `arguments` and
   `parameters` keys. A Hermes / Qwen `<tool_call>{json}</tool_call>`
   wrapper around a Llama-3.2 fine-tune that emits the `parameters`
   key was extracting the tool name and silently discarding the
   args, producing a working-shaped call with an empty payload. The
   bare-JSON and python_tag paths already accepted both keys; this
   path now matches them.

2. `_TC_FUNC_START_RE`, `_TC_PARAM_START_RE`, and `_TC_PARAM_CLOSE_RE`
   now also match the attribute form
   `<function name="..."><param name="...">v</param></function>` used
   by MiniCPM-5 and MiniMax-M2. Names land in either capture group,
   and `</param>` is accepted as a short close.

3. `_parse_llama3_bare_json` sentinel-strip now consumes the role
   label inserted between `<|start_header_id|>` and
   `<|end_header_id|>` by Meta's official Llama-3.x chat template.
   Without this, every assistant turn re-fed through the template
   prefix `<|start_header_id|>assistant<|end_header_id|>\n\n{json}`
   parsed to zero calls, so any history-with-tool-call round-trip
   in production silently dropped.

Tests in `studio/backend/tests/test_safetensors_tool_loop.py`:

* `TestParserRobustness::test_tool_call_json_accepts_parameters_key`
* `TestParserRobustness::test_function_xml_attribute_form`
* `TestParserRobustness::test_function_xml_attribute_form_multi_param`
* `TestParserRobustness::test_function_xml_legacy_equals_form_still_works`
  (regression guard for the existing `<function=name>` syntax)
* `TestParserRobustness::test_llama3_chat_template_round_trip`
* `TestParserRobustness::test_llama3_round_trip_all_roles`
* `TestParserRobustness::test_llama3_round_trip_with_eot_prefix`

`pytest studio/backend/tests/test_safetensors_tool_loop.py
        studio/backend/tests/test_safetensors_capability_advertise.py -q`
goes from 118 to 125 passed.
Two conflicts, both in the tool-call markup-strip plumbing:

* `studio/backend/core/inference/tool_call_parser.py` -- main widened
  `<function=NAME>` and `<parameter=NAME>` from `\w+` to `[\w-]+` so
  MCP tool ids with hyphens (mcp__srv__list-issues, `repo-name`,
  `issue-number`) parse the same as the built-ins. PR-5620 had
  broadened the same patterns to `[\w\.\-]+` (also handles dotted
  module names) and added the python_tag / Mistral / Gemma 4 strip
  families plus the `TOOL_XML_SIGNALS` constant. Resolved by keeping
  PR-5620's full pattern set; `[\w\.\-]+` already accepts the hyphens
  main was widening to, so no MCP coverage is lost.

* `studio/backend/routes/inference.py` -- `_TOOL_XML_RE` regex on
  the route-level scrub path. Same hyphen widening on main vs.
  PR-5620's union of (closed pair / orphan open / Gemma 4 /
  python_tag / `</parameter>` tail-only). Resolved by unioning all
  five patterns and applying main's `[\w-]+` on the `function=` name
  char class so hyphenated MCP names get stripped on the route too.

pytest studio/backend/tests/test_safetensors_tool_loop.py
       studio/backend/tests/test_safetensors_capability_advertise.py -q
-> 125 passed.
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 27, 2026
Four fixes addressing review of the parent commit:

1. GLM <arg_value> coercion: tighten the
   json.loads -> ast.literal_eval -> raw cascade to only deserialize
   when the body unambiguously looks like a JSON literal (object,
   array, JSON-encoded string, true/false/null, or numeric). Strings
   like ``True`` / ``None`` (Python literals, not JSON) and arbitrary
   prose now stay raw. The bare-numeric / bare-boolean ambiguity with
   string args remains an inherent limitation of the template without
   schema access -- documented in the new comment. Drops the ast
   import entirely (closes Gemini's :1036 suggestion).

2. Kimi K2 bare-counter ids (e.g. ``<|tool_call_begin|>3``) are now
   dropped rather than surfaced as a tool literally named "3". Matches
   vLLM behaviour; SGLang's schema-infer fallback is out of scope at
   the parse site. Real Kimi K2 emissions use ``functions.NAME:IDX``
   so this is the exception path.

3. Restore the elaborate ``<|python_tag|>(?:[^<]|<(?!\|))*`` clause in
   routes.inference._TOOL_XML_RE -- the simpler ``[^\n<]*`` form
   regressed PR unslothai#5620's multi-line / literal-``<`` python_tag fix.
   Restore ``TestRoutesPythonTagStrip`` (8 tests) adapted to call
   ``_TOOL_XML_RE.sub`` directly since the ``_strip_tool_xml`` helper
   was inlined this PR.

4. Add the spaced and backslash-escaped DeepSeek opener variants
   (``<|tool calls begin|>``, ``<|tool\_calls\_begin|>``) to
   ``TOOL_XML_SIGNALS`` for streaming-gate parity with
   ``_DEEPSEEK_BEGIN_RE``.

Also updates the llama.cpp / vLLM citations in the parser docstrings:
``common/chat-parser.cpp`` was split into ``common/chat.cpp`` +
``common/chat-peg-parser.cpp`` by llama.cpp PR #18675, and vLLM
moved the tool parsers from ``vllm/entrypoints/openai/tool_parsers/``
to ``vllm/tool_parsers/``. Pin to pre-refactor commit ``51fa458a92d6``
where the cited line numbers still resolve.

New regression tests in ``test_pr5624_regressions.py`` cover the GLM
coercion heuristic shapes, GLM literal-``<`` in arg_value, Kimi K2
dotted name, Kimi K2 bare-counter drop, DeepSeek V3.1 truncated
mid-stream, and routes-layer strip across all three new families.

Tests:
  pytest studio/backend/tests/test_safetensors_tool_loop.py
         studio/backend/tests/test_safetensors_capability_advertise.py
         studio/backend/tests/test_pr5624_regressions.py -q
  -> 170 passed in 1.91s
danielhanchen and others added 2 commits May 27, 2026 14:23
…all>

`_parse_function_xml` was looking for `</tool_call>` (the Hermes
wrapper) as the body terminator. When a model emits a standalone
`<function=NAME><parameter=K>v</parameter></function>` followed by
explanatory prose (which models routinely do), no `</tool_call>` is
present, so the body extended to end-of-string and the trailing
prose leaked into the LAST parameter value.

Pre-existing on main (the legacy `<function=NAME>` form had this
bug too). Same affects PR #5620's new attribute-form
`<function name="NAME"><param name="K">v</param></function>`
emission used by MiniCPM-5 / MiniMax-M2.

Fix: `_TC_END_TAG_RE` now matches either `</tool_call>` OR
`</function>`. The existing `_TC_FUNC_CLOSE_RE` / `_TC_PARAM_CLOSE_RE`
strips are unchanged. Multi-call inputs still bound each function
at the next `<function=` start, so no over-eager consumption.

New tests:

* `test_function_xml_followed_by_prose` (legacy form + prose)
* `test_function_attribute_xml_followed_by_prose` (attribute form + prose)

Existing `test_code_with_embedded_xml` still passes (a parameter
value containing literal `<a></a>` is preserved because the
embedded close tag is `</a>`, not `</function>`).

`pytest studio/backend/tests/test_safetensors_tool_loop.py
        studio/backend/tests/test_safetensors_capability_advertise.py -q`
goes from 125 to 127 passed.
A fuzz pass on PR #5811 turned up that ``_parse_llama3_bare_json``
accepted ``parameters`` as a string, contradicting the docstring's
"parameters or arguments is a dict" guard. Prose JSON like
``{"name":"foo","parameters":"a sentence"}`` would wrongly fire the
parser, which the agentic loop would then heal into a real
``foo(query="a sentence")`` call.

Same code lives on this branch, so the same fix applies here.

Tightened guard:

  - ``parameters`` must be a dict (Llama-3 spec).
  - ``arguments`` may be a dict, or a JSON-encoded string that
    decodes to a dict (OpenAI shape, e.g.
    ``"arguments":"{\"q\":\"x\"}"``). Plain non-JSON strings or
    JSON-strings of lists / scalars / null no longer pass.

Mirrors the fix landed in PR #5811 commit 615b860. Adds the same
4 regression tests under TestParserMultiFormat.

Existing test suite stays green: 127 -> 131 passing.
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 27, 2026
The JSON sub-path of ``_parse_llama3_python_tag`` was fabricating
``{"value": args}`` when the model emitted a non-dict / non-string
``arguments`` value (e.g. ``42``, ``[1,2,3]``, ``null``, ``true``).
This silently turned a malformed emission into a real tool call,
which the agentic loop would then execute with arguments the model
never intended.

Tightened: skip the call instead of fabricating. The same
behaviour now matches the bare-JSON guard tightened earlier
(strict-guard merge from PR unslothai#5620, inherited via merge here).

Added a regression test covering the four non-scalar shapes.
Pass count on this branch: 158 -> 159.

Sites in ``_parse_tool_call_json`` and ``_consume_mistral_call``
keep the existing looser behaviour for now; both are reached
only after explicit ``<tool_call>`` / ``[TOOL_CALLS]`` markers
so the false-positive surface there is much narrower.
danielhanchen and others added 2 commits May 31, 2026 13:17
…ALL_ID / THINK, attribute-form signal)

Three GGUF-parity fixes to the safetensors tool-call parser, each matching
llama.cpp's reference behaviour:

- Mistral Small 3.2 emits [TOOL_CALLS]name[CALL_ID]<id>[ARGS]{json}. The
  parser stopped after the name on seeing [CALL_ID] (neither [ARGS] nor {),
  dropping the call. Skip an optional [CALL_ID]<id> segment in both the
  parse and strip paths. llama.cpp parses this (test-chat.cpp:4785).

- Magistral wraps reasoning in [THINK]...[/THINK]. A [TOOL_CALLS] inside the
  reasoning was parsed as a real call, producing a phantom call. Strip a
  leading [THINK] block before scanning so only the post-reasoning call
  counts (test-chat.cpp:2285); a literal [THINK] inside a later argument is
  left intact.

- The standalone MiniCPM-5 / MiniMax-M2 <function name="..."> attribute form
  parsed correctly but was absent from TOOL_XML_SIGNALS and the markup strip
  patterns, so the streaming safety-net parse was gated off (dropping the
  call) and markup leaked into displayed text. Add the signal and broaden
  the strip regexes.

Adds regression tests for all three.
The agentic loop's streaming safety-net parse was gated on
has_tool_signal(), which is False for the Llama-3.1 / 3.2 bare-JSON tool
form {"name":..,"parameters":..} (no XML marker). Real tool calls were
therefore dropped: the loop logged "model planned without calling tools",
re-prompted three times, then gave up with zero tool calls, while GGUF's
llama-server parses the same emission natively.

Run parse_tool_calls_from_text() unconditionally in the safety net. The
parser is strict (only fires on a valid tool-call shape) so plain answers
are unaffected. Reproduced on a real unsloth/Llama-3.1-8B-Instruct run:
the model emits {"name":"web_search","parameters":{...}} which now
executes the tool instead of being re-prompted into a no-op.

Adds a loop regression test for the bare-JSON form.
@danielhanchen

Copy link
Copy Markdown
Member Author

Confirmed the Llama-3 side of this on a B200 against the matching unsloth UD-Q4_K_XL GGUF. Llama-3.1-8B safetensors fires tools on 18/18 runs (6 prompts, 3 seeds, same generation params as the GGUF), matching the GGUF 18/18 and passing 5/6 of the deterministic code checks vs 2/6 for the GGUF. The bare-JSON form {"name": ..., "parameters": ...} is the key case, since the skip-special-token stream drops <|python_tag|> and only the multi-format parser recovers it.

Mistral-Small-3.2 refuses to call tools in both backends for these prompts, so that one is model behaviour rather than the parser. Full per-family numbers and the cross-checks against llama.cpp, vLLM, and SGLang are on the stacked PR #5624.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants