studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5620
studio: tool calling + healing parity for Llama-3, Mistral, Gemma 4 on safetensors + MLX#5620danielhanchen wants to merge 20 commits into
Conversation
…LX (#5615) Adds tool calling for Llama-3, Mistral (pre-v11 + v11+ + [ARGS]), and Gemma 4 to the safetensors / transformers and MLX backends. Parser patched against llama.cpp / vLLM / SGLang per-family parsers and normalises to OpenAI shape. 96 targeted unit tests + cross-OS staging CI (ubuntu / macos-14 / windows) green on the multi-format probe.
After the multi-format parser landed in #5615, the safetensors / MLX agentic loop and the GGUF loop still differed on healing behaviour. This commit closes the gaps in both directions so the two backends react the same way to identical model output. Changes: 1. core/inference/llama_cpp.py -- the GGUF BUFFERING state machine now wakes on every emission marker the shared parser knows. Was ("<tool_call>", "<function="); is now the five-tuple imported from core.inference.tool_call_parser (Qwen / Qwen3.5 / Llama-3 <|python_tag|> / Mistral [TOOL_CALLS] / Gemma 4 <|tool_call>). Stream cleanup is delegated to the same shared strip_tool_markup so leaked markup from any family is removed from assistant content. 2. core/inference/llama_cpp.py -- per-tool canonical heal key. When a tool arguments field is a bare string and JSON parsing fails, the GGUF path now heals to {"code": raw_args} for python, {"command": raw_args} for terminal, and {"query": raw_args} for everything else. Was hard-coded to {"query": raw_args}, which silently routed every python / terminal emission through web_search. Mirrors safetensors_agentic._CANONICAL_HEAL_ARG. 3. core/inference/safetensors_agentic.py -- re-prompt on plan- without-action. When the model emits a short forward-looking intent ("I'll search for that", "Let me check", "First, I will...") and no tool call, the loop nudges the model to act instead of silently returning a plan-only answer. Up to _MAX_REPROMPTS=3 (matches GGUF). The intent regex, character cap, and instruction text are byte-identical to the GGUF path. The buffer-end fall-through is unified so a buffered intent emission that never exits the BUFFERING state still triggers the re-prompt. 4. core/inference/safetensors_agentic.py -- extra iteration slots for re-prompts. The loop now budgets max_tool_iterations + _MAX_REPROMPTS + 1 total iterations and tracks the tool-call count separately, so a stalling model can be nudged 3x without eating the caller's tool-call budget. Mirrors the _extra slot reservation in the GGUF path. Tests (14 new safetensors-side units; 5 GGUF parity pins): TestLoopRePrompt -- intent-trigger, plain-answer, no-tools, cap-at-three, budget preserved, buffer-end intent. TestLoopCanonicalHealKey -- python / terminal / unknown. TestGGUFSafetensorsHealingParity -- shared markers used, shared strip used, canonical heal keys identical, intent regex matches same phrases, _MAX_REPROMPTS equal on both backends. All 110 targeted tests pass locally; the broader tool / inference / model-config / sandbox / anthropic / mlx suites stay green. Why this matters Without this parity, Llama-3.2 / Mistral / Gemma 4 emissions on Mac (MLX) and Linux-safetensors stop the agentic loop as soon as the model says "Let me...", because the GGUF re-prompt logic never existed on these backends. The two-marker GGUF BUFFERING tuple also let non-Qwen tool emissions stream out as plain prose when llama-server's structured channel did not pick them up. Both paths now drain the same way, heal the same way, and re-prompt the same way -- so a tool call that works on GGUF works identically on safetensors / MLX.
for more information, see https://pre-commit.ci
Mirrors the healing-parity follow-up commit from PR unslothai#5620: 1. GGUF BUFFERING tuple now uses the shared 5-format TOOL_XML_SIGNALS so non-Qwen tool emissions wake the state machine (was only 2 markers). 2. GGUF stream cleanup delegates to the shared strip_tool_markup so leaked markup from any family is removed from assistant content. 3. GGUF per-tool canonical heal key (python -> code, terminal -> command, * -> query) when arguments is a bare string. 4. Safetensors / MLX re-prompt on plan-without-action with _MAX_REPROMPTS=3 + extra iteration slots so re-prompts do not eat the tool budget. Also pulls in core/tool_healing.py which staging was missing (the legacy two-format helper module that llama_cpp.py imports the regex constants from).
There was a problem hiding this comment.
Code Review
This pull request significantly expands the tool-call parsing capabilities to support multiple model families, including Llama-3, Mistral, and Gemma 4, while ensuring parity between GGUF and safetensors backends. It introduces a re-prompting mechanism to nudge models that express intent without taking action and implements per-tool canonical heal keys. Feedback highlights potential issues with regex patterns for Mistral and Llama-3 that could lead to incorrect truncation or leaked markup, a bug in string unescaping that corrupts non-ASCII characters, and an order-dependency in sentinel stripping logic.
| re.compile(r"<\|tool_call>.*?<tool_call\|>", re.DOTALL), | ||
| re.compile(r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?", re.DOTALL), | ||
| # Mistral v11+ ``[TOOL_CALLS]name{json}`` (may chain), close at ``}``. | ||
| re.compile(r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}", re.DOTALL), |
There was a problem hiding this comment.
The regex for Mistral v11+ tool calls uses a non-greedy .*? to match the JSON arguments, which will incorrectly truncate the block at the first closing brace }. If the model emits nested JSON (e.g., {"a": {"b": 1}}), the stripping logic will leave trailing garbage in the assistant stream. A similar issue affects the Mistral pre-v11 array pattern on line 55 if the array contains nested brackets.
| k = kv.group(1) | ||
| if kv.group(2) is not None: | ||
| try: | ||
| args[k] = bytes(kv.group(2), "utf-8").decode("unicode_escape") |
There was a problem hiding this comment.
Using unicode_escape on a UTF-8 encoded byte string will corrupt literal non-ASCII characters (e.g., emojis or non-Latin scripts) emitted by the model. For example, a literal ✨ (U+2728) becomes â\x9c¨ after this transformation. Since the regex already ensures that internal quotes are escaped, a safer way to unescape the string while preserving literal UTF-8 is to use json.loads on the quoted value.
| args[k] = bytes(kv.group(2), "utf-8").decode("unicode_escape") | |
| args[k] = json.loads('"' + kv.group(2) + '"') |
| for sentinel in ( | ||
| "<|begin_of_text|>", | ||
| "<|eot_id|>", | ||
| "<|start_header_id|>", | ||
| "<|end_header_id|>", | ||
| "<|eom_id|>", | ||
| ): | ||
| stripped = stripped.lstrip() | ||
| if stripped.startswith(sentinel): | ||
| stripped = stripped[len(sentinel) :] |
There was a problem hiding this comment.
The sentinel stripping logic is order-dependent and will fail to remove multiple sentinels if they appear in an order different from the tuple (e.g., <|eot_id|><|begin_of_text|> will leave <|begin_of_text|> behind because the loop already passed it). It should use a loop that repeatedly checks for any of the sentinels at the start of the string.
| for sentinel in ( | |
| "<|begin_of_text|>", | |
| "<|eot_id|>", | |
| "<|start_header_id|>", | |
| "<|end_header_id|>", | |
| "<|eom_id|>", | |
| ): | |
| stripped = stripped.lstrip() | |
| if stripped.startswith(sentinel): | |
| stripped = stripped[len(sentinel) :] | |
| while True: | |
| stripped = stripped.lstrip() | |
| found = False | |
| for sentinel in ( | |
| "<|begin_of_text|>", | |
| "<|eot_id|>", | |
| "<|start_header_id|>", | |
| "<|end_header_id|>", | |
| "<|eom_id|>", | |
| ): | |
| if stripped.startswith(sentinel): | |
| stripped = stripped[len(sentinel) :] | |
| found = True | |
| break | |
| if not found: | |
| break |
References
- Combine checks and transformations into a single loop to improve efficiency and handle multiple sentinels correctly.
| r"<\|tool_call>.*?<tool_call\|>", | ||
| r"\[TOOL_CALLS\]\s*\[.*?\](?:\s*</s>)?", | ||
| r"\[TOOL_CALLS\]\s*[\w\.\-]+\s*(?:\[ARGS\])?\s*\{.*?\}", | ||
| r"<\|python_tag\|>[^\n<]*", |
There was a problem hiding this comment.
The regex for stripping Llama-3 tags is too restrictive. By stopping at the first newline or < character, it will leak the remainder of the tool call to the user if the call contains a < (common in queries or code) or spans multiple lines. This results in internal markup being visible in the chat UI during streaming.
Three high-priority gemini findings on the tool-call parsing additions:
1. unicode_escape on UTF-8 bytes corrupts non-ASCII literals
(e.g. ✨ becomes â\x9c¨). Replace with json.loads on a quoted
string -- preserves emoji / CJK / RTL while still handling
\n \t \uXXXX escapes.
2. Llama-3 sentinel stripping is order-dependent. A leading
`<|eot_id|><|begin_of_text|>` left `<|begin_of_text|>` behind
because the loop had already passed that sentinel. Loop until
no sentinel matches at the start.
3. Mistral v11+ `[TOOL_CALLS] name { json }` regex uses non-greedy
`\{.*?\}` which truncates at the first `}` of a nested JSON
argument, leaking the tail (e.g. `}}`) into user-visible
streamed text. Same problem for the v0.3 array pattern with
nested brackets. Strip those with balanced brace/bracket
scanning via a new `_strip_mistral_closed_calls` helper called
from `strip_tool_markup`.
Also fix the inference routes' parallel `_TOOL_XML_RE`:
- Same nested-JSON truncation in the Mistral patterns; route the
strip through the parser's balanced-scan helper via a thin
`_strip_tool_xml` wrapper that all existing callers now use.
- Llama-3 `<|python_tag|>[^\n<]*` stopped at any `<`, leaking the
tail of any tool call whose argument contained a literal `<`
(queries, code snippets). Relax to `[^\n]*` which keeps the
strip confined to the actual end-of-line.
for more information, see https://pre-commit.ci
|
Looks safe to land. Pulled What 5620 adds beyond 5615 (parser/inference scope):
Each of the five #5615 regressions has a fix + test in 5620 — details in #5615/#5619. Tests: Worth watching when the validation phase finishes:
Holding the |
Earlier revisions of _TOOL_XML_RE in studio.backend.routes.inference
oscillated between two bug shapes:
5615 r"<\|python_tag\|>[^\n<]*" -- stopped at any literal "<"
so code='if x < 10: pass'
leaked '< 10: pass)' to the
user.
5620.1 r"<\|python_tag\|>[^\n]*" -- single-line only; the second
line of
python.call(code="a\nb")
leaked.
The full parser (_parse_llama3_python_tag) already handles both via
balanced-brace scanning, so the parsing path was fine; the LEAK was
in the streaming strip path that runs on every cumulative emission
while content is still arriving.
Switch to r"<\|python_tag\|>(?:[^<]|<(?!\|))*" so the strip consumes:
* any character that is not a "<" (newlines, JSON, code, ...),
* a "<" only when it is NOT followed by "|" (i.e. NOT a Llama-3
sentinel start like <|eot_id|>, <|eom_id|>, <|begin_of_text|>).
This means:
* code='if x < 10' stays inside the strip (5615 fix preserved),
* multi-line code stays inside the strip (5620 round 2),
* the strip terminates at the next Llama-3 sentinel so trailing
assistant content survives.
Tests: TestRoutesPythonTagStrip (8 cases)
pytest test_safetensors_tool_loop.py test_safetensors_capability_advertise.py
-> 118 passed in 1.81s (was 110).
for more information, see https://pre-commit.ci
|
Pushed Earlier history of
The The remaining false-positive concern (a model echoing back |
Each PR ran the same staged source files before, which went stale when the upstream PR commits advanced. Refactor to one job per PR with an actions/checkout of that PR's head ref, so cross-OS validation always uses the latest commit: - PR unslothai#5603 sandbox -> studio-sandbox-hardening - PR unslothai#5620 parser parity -> studio-tools-multi-format-v2 - PR unslothai#5696 mtp reload guards -> followup-mtp-reload-guards (unslothai#5582 followup) - PR unslothai#5695 lockfile audit -> followup-lockfile-audit-regressions (unslothai#5604 followup) 4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap). cancel-in-progress per (workflow, ref) keeps iteration cheap. All tests stay CPU-only and rely on the CUDA spoof harness in tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU is required on any runner.
unslothai#5620 parser tests transitively import the safetensors loop, which needs the datasets package. unslothai#5696 route-guard tests import routes/inference.py, which transitively imports core/training (uses matplotlib). Add both, plus the auth deps (pyjwt/cryptography/ aiosqlite/python-multipart) needed for any test that touches the FastAPI route module so route-level imports resolve cleanly on all three OSes.
Conflict in studio/backend/routes/inference.py: combine #5620's multi-format strip (Llama-3 <|python_tag|>, Gemma 4 <|tool_call>... <tool_call|>, Mistral [TOOL_CALLS] via parser helper) with main's speculative-buffer leak fixes (orphan open to EOF, bare orphan close, tail-only </parameter> anchored to \Z so mid-text <parameter> in user code samples survives). pytest studio/backend/tests/test_safetensors_tool_loop.py \ studio/backend/tests/test_safetensors_capability_advertise.py -q -> 118 passed.
for more information, see https://pre-commit.ci
Comments were narrating what the code already says. Cut historical
"earlier revisions used X, then Y" narratives down to one-line WHY
notes where the footgun still matters (canonical heal-key parity,
balanced-brace vs non-greedy regex, ``(?:[^<]|<(?!\|))*`` over
``[^\n<]*``/``[^\n]*``). Drop section-header banners.
No behaviour change. Re-ran:
pytest studio/backend/tests/test_safetensors_tool_loop.py \
studio/backend/tests/test_safetensors_capability_advertise.py -q
-> 118 passed.
Regression replay (parser + _coerce_arguments on the 5 #5615 inputs)
-> 21/21.
for more information, see https://pre-commit.ci
Three surgical extensions to the multi-format tool-call parser, each
covering a real fine-tune / template emission shape that the current
parser silently drops. No path narrows; all changes widen what is
accepted.
1. `_parse_tool_call_json` now accepts both `arguments` and
`parameters` keys. A Hermes / Qwen `<tool_call>{json}</tool_call>`
wrapper around a Llama-3.2 fine-tune that emits the `parameters`
key was extracting the tool name and silently discarding the
args, producing a working-shaped call with an empty payload. The
bare-JSON and python_tag paths already accepted both keys; this
path now matches them.
2. `_TC_FUNC_START_RE`, `_TC_PARAM_START_RE`, and `_TC_PARAM_CLOSE_RE`
now also match the attribute form
`<function name="..."><param name="...">v</param></function>` used
by MiniCPM-5 and MiniMax-M2. Names land in either capture group,
and `</param>` is accepted as a short close.
3. `_parse_llama3_bare_json` sentinel-strip now consumes the role
label inserted between `<|start_header_id|>` and
`<|end_header_id|>` by Meta's official Llama-3.x chat template.
Without this, every assistant turn re-fed through the template
prefix `<|start_header_id|>assistant<|end_header_id|>\n\n{json}`
parsed to zero calls, so any history-with-tool-call round-trip
in production silently dropped.
Tests in `studio/backend/tests/test_safetensors_tool_loop.py`:
* `TestParserRobustness::test_tool_call_json_accepts_parameters_key`
* `TestParserRobustness::test_function_xml_attribute_form`
* `TestParserRobustness::test_function_xml_attribute_form_multi_param`
* `TestParserRobustness::test_function_xml_legacy_equals_form_still_works`
(regression guard for the existing `<function=name>` syntax)
* `TestParserRobustness::test_llama3_chat_template_round_trip`
* `TestParserRobustness::test_llama3_round_trip_all_roles`
* `TestParserRobustness::test_llama3_round_trip_with_eot_prefix`
`pytest studio/backend/tests/test_safetensors_tool_loop.py
studio/backend/tests/test_safetensors_capability_advertise.py -q`
goes from 118 to 125 passed.
for more information, see https://pre-commit.ci
Two conflicts, both in the tool-call markup-strip plumbing:
* `studio/backend/core/inference/tool_call_parser.py` -- main widened
`<function=NAME>` and `<parameter=NAME>` from `\w+` to `[\w-]+` so
MCP tool ids with hyphens (mcp__srv__list-issues, `repo-name`,
`issue-number`) parse the same as the built-ins. PR-5620 had
broadened the same patterns to `[\w\.\-]+` (also handles dotted
module names) and added the python_tag / Mistral / Gemma 4 strip
families plus the `TOOL_XML_SIGNALS` constant. Resolved by keeping
PR-5620's full pattern set; `[\w\.\-]+` already accepts the hyphens
main was widening to, so no MCP coverage is lost.
* `studio/backend/routes/inference.py` -- `_TOOL_XML_RE` regex on
the route-level scrub path. Same hyphen widening on main vs.
PR-5620's union of (closed pair / orphan open / Gemma 4 /
python_tag / `</parameter>` tail-only). Resolved by unioning all
five patterns and applying main's `[\w-]+` on the `function=` name
char class so hyphenated MCP names get stripped on the route too.
pytest studio/backend/tests/test_safetensors_tool_loop.py
studio/backend/tests/test_safetensors_capability_advertise.py -q
-> 125 passed.
Four fixes addressing review of the parent commit: 1. GLM <arg_value> coercion: tighten the json.loads -> ast.literal_eval -> raw cascade to only deserialize when the body unambiguously looks like a JSON literal (object, array, JSON-encoded string, true/false/null, or numeric). Strings like ``True`` / ``None`` (Python literals, not JSON) and arbitrary prose now stay raw. The bare-numeric / bare-boolean ambiguity with string args remains an inherent limitation of the template without schema access -- documented in the new comment. Drops the ast import entirely (closes Gemini's :1036 suggestion). 2. Kimi K2 bare-counter ids (e.g. ``<|tool_call_begin|>3``) are now dropped rather than surfaced as a tool literally named "3". Matches vLLM behaviour; SGLang's schema-infer fallback is out of scope at the parse site. Real Kimi K2 emissions use ``functions.NAME:IDX`` so this is the exception path. 3. Restore the elaborate ``<|python_tag|>(?:[^<]|<(?!\|))*`` clause in routes.inference._TOOL_XML_RE -- the simpler ``[^\n<]*`` form regressed PR unslothai#5620's multi-line / literal-``<`` python_tag fix. Restore ``TestRoutesPythonTagStrip`` (8 tests) adapted to call ``_TOOL_XML_RE.sub`` directly since the ``_strip_tool_xml`` helper was inlined this PR. 4. Add the spaced and backslash-escaped DeepSeek opener variants (``<|tool calls begin|>``, ``<|tool\_calls\_begin|>``) to ``TOOL_XML_SIGNALS`` for streaming-gate parity with ``_DEEPSEEK_BEGIN_RE``. Also updates the llama.cpp / vLLM citations in the parser docstrings: ``common/chat-parser.cpp`` was split into ``common/chat.cpp`` + ``common/chat-peg-parser.cpp`` by llama.cpp PR #18675, and vLLM moved the tool parsers from ``vllm/entrypoints/openai/tool_parsers/`` to ``vllm/tool_parsers/``. Pin to pre-refactor commit ``51fa458a92d6`` where the cited line numbers still resolve. New regression tests in ``test_pr5624_regressions.py`` cover the GLM coercion heuristic shapes, GLM literal-``<`` in arg_value, Kimi K2 dotted name, Kimi K2 bare-counter drop, DeepSeek V3.1 truncated mid-stream, and routes-layer strip across all three new families. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 170 passed in 1.91s
…all> `_parse_function_xml` was looking for `</tool_call>` (the Hermes wrapper) as the body terminator. When a model emits a standalone `<function=NAME><parameter=K>v</parameter></function>` followed by explanatory prose (which models routinely do), no `</tool_call>` is present, so the body extended to end-of-string and the trailing prose leaked into the LAST parameter value. Pre-existing on main (the legacy `<function=NAME>` form had this bug too). Same affects PR #5620's new attribute-form `<function name="NAME"><param name="K">v</param></function>` emission used by MiniCPM-5 / MiniMax-M2. Fix: `_TC_END_TAG_RE` now matches either `</tool_call>` OR `</function>`. The existing `_TC_FUNC_CLOSE_RE` / `_TC_PARAM_CLOSE_RE` strips are unchanged. Multi-call inputs still bound each function at the next `<function=` start, so no over-eager consumption. New tests: * `test_function_xml_followed_by_prose` (legacy form + prose) * `test_function_attribute_xml_followed_by_prose` (attribute form + prose) Existing `test_code_with_embedded_xml` still passes (a parameter value containing literal `<a></a>` is preserved because the embedded close tag is `</a>`, not `</function>`). `pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q` goes from 125 to 127 passed.
A fuzz pass on PR #5811 turned up that ``_parse_llama3_bare_json`` accepted ``parameters`` as a string, contradicting the docstring's "parameters or arguments is a dict" guard. Prose JSON like ``{"name":"foo","parameters":"a sentence"}`` would wrongly fire the parser, which the agentic loop would then heal into a real ``foo(query="a sentence")`` call. Same code lives on this branch, so the same fix applies here. Tightened guard: - ``parameters`` must be a dict (Llama-3 spec). - ``arguments`` may be a dict, or a JSON-encoded string that decodes to a dict (OpenAI shape, e.g. ``"arguments":"{\"q\":\"x\"}"``). Plain non-JSON strings or JSON-strings of lists / scalars / null no longer pass. Mirrors the fix landed in PR #5811 commit 615b860. Adds the same 4 regression tests under TestParserMultiFormat. Existing test suite stays green: 127 -> 131 passing.
The JSON sub-path of ``_parse_llama3_python_tag`` was fabricating
``{"value": args}`` when the model emitted a non-dict / non-string
``arguments`` value (e.g. ``42``, ``[1,2,3]``, ``null``, ``true``).
This silently turned a malformed emission into a real tool call,
which the agentic loop would then execute with arguments the model
never intended.
Tightened: skip the call instead of fabricating. The same
behaviour now matches the bare-JSON guard tightened earlier
(strict-guard merge from PR unslothai#5620, inherited via merge here).
Added a regression test covering the four non-scalar shapes.
Pass count on this branch: 158 -> 159.
Sites in ``_parse_tool_call_json`` and ``_consume_mistral_call``
keep the existing looser behaviour for now; both are reached
only after explicit ``<tool_call>`` / ``[TOOL_CALLS]`` markers
so the false-positive surface there is much narrower.
…ALL_ID / THINK, attribute-form signal)
Three GGUF-parity fixes to the safetensors tool-call parser, each matching
llama.cpp's reference behaviour:
- Mistral Small 3.2 emits [TOOL_CALLS]name[CALL_ID]<id>[ARGS]{json}. The
parser stopped after the name on seeing [CALL_ID] (neither [ARGS] nor {),
dropping the call. Skip an optional [CALL_ID]<id> segment in both the
parse and strip paths. llama.cpp parses this (test-chat.cpp:4785).
- Magistral wraps reasoning in [THINK]...[/THINK]. A [TOOL_CALLS] inside the
reasoning was parsed as a real call, producing a phantom call. Strip a
leading [THINK] block before scanning so only the post-reasoning call
counts (test-chat.cpp:2285); a literal [THINK] inside a later argument is
left intact.
- The standalone MiniCPM-5 / MiniMax-M2 <function name="..."> attribute form
parsed correctly but was absent from TOOL_XML_SIGNALS and the markup strip
patterns, so the streaming safety-net parse was gated off (dropping the
call) and markup leaked into displayed text. Add the signal and broaden
the strip regexes.
Adds regression tests for all three.
for more information, see https://pre-commit.ci
The agentic loop's streaming safety-net parse was gated on
has_tool_signal(), which is False for the Llama-3.1 / 3.2 bare-JSON tool
form {"name":..,"parameters":..} (no XML marker). Real tool calls were
therefore dropped: the loop logged "model planned without calling tools",
re-prompted three times, then gave up with zero tool calls, while GGUF's
llama-server parses the same emission natively.
Run parse_tool_calls_from_text() unconditionally in the safety net. The
parser is strict (only fires on a valid tool-call shape) so plain answers
are unaffected. Reproduced on a real unsloth/Llama-3.1-8B-Instruct run:
the model emits {"name":"web_search","parameters":{...}} which now
executes the tool instead of being re-prompted into a no-op.
Adds a loop regression test for the bare-JSON form.
…to studio-tools-multi-format-v2
|
Confirmed the Llama-3 side of this on a B200 against the matching Mistral-Small-3.2 refuses to call tools in both backends for these prompts, so that one is model behaviour rather than the parser. Full per-family numbers and the cross-checks against llama.cpp, vLLM, and SGLang are on the stacked PR #5624. |
DRAFT - holding open while we validate end to end with real models. Do not merge until safetensors + MLX behaviour is verified for every covered family.