Studio: tools, thinking blocks, code execution and web search for safetensors#5520
Conversation
The GGUF/llama-server backend already streams tool_start/tool_end events,
strips <tool_call> XML, parses <think> blocks, and runs an agentic loop
through web_search / python / terminal. The transformers/safetensors
backend was inference-only: no tools, no template-level reasoning
controls, no agentic loop. This change brings safetensors to parity for
non-vision text chat while leaving the GGUF path untouched.
Backend changes:
core/inference/tool_call_parser.py (new): backend-neutral
parse_tool_calls_from_text, strip_tool_markup, has_tool_signal, and
shared regex/strip patterns. LlamaCppBackend._parse_tool_calls_from_text
delegates here, so both paths fix-forward together.
core/inference/safetensors_agentic.py (new): cumulative-text agentic
loop with a 3-state buffer (BUFFERING, STREAMING, DRAINING). Yields
the same status / content / tool_start / tool_end / metadata events
the GGUF path already emits. Handles duplicate-call short-circuit,
__IMAGES__ sentinel stripping before model feedback, error-prefix
tagging, cancel_event, and max_tool_iterations capping.
core/inference/inference.py: generate_chat_response now accepts
tools / enable_thinking / reasoning_effort / preserve_thinking;
_apply_chat_template_for_generation peels unsupported kwargs off the
template call in safe order (richest first). New
generate_chat_completion_with_tools method wraps the agentic loop.
core/inference/orchestrator.py: forwards the new kwargs through IPC
(gen + dispatched paths); adds generate_chat_completion_with_tools
that drives the loop from the parent process.
core/inference/worker.py: pulls tools/enable_thinking/reasoning_effort/
preserve_thinking from the cmd dict when present and forwards to
backend.generate_chat_response.
routes/inference.py: shared _detect_safetensors_features helper that
calls detect_reasoning_flags on the loaded tokenizer template so the
load/already_loaded/status endpoints all advertise the same flags
GGUF does. New safetensors tool-calling SSE branch in
POST /chat/completions that mirrors the GGUF flow (system prompt
nudge, tool subset filtering, stale-XML scrubbing of prior
assistant turns). gpt-oss is gated out of the safetensors tool path
because Harmony uses a dedicated channel for tool calls rather than
<tool_call> XML; GGUF still serves that case.
Tests:
tests/test_safetensors_tool_loop.py: 22 tests covering parser
shapes (closed/unclosed JSON, function/parameter XML, embedded
</parameter> in code, multiple calls, bad JSON), agentic-loop
control flow (plain answers, single tool then answer, truncated
unclosed call, JSON-string arguments healed to {"query": ...}),
behaviour (duplicate-call short-circuit, image-sentinel survival,
tool error nudge, raised exceptions caught), and control
(cancel_event break, max_tool_iterations cap).
Backwards compatibility:
LlamaCppBackend._parse_tool_calls_from_text keeps the same signature
and behaviour.
All new IPC kwargs are optional and only added to the cmd dict when
set, so older worker payloads are unaffected.
The SSE event protocol matches the existing GGUF tool stream so the
frontend tool UI works unchanged.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request implements an agentic tool loop for the safetensors/transformers backend, aligning its capabilities with the GGUF path. Key changes include the introduction of a backend-neutral tool-call XML parser, a new safetensors-specific agentic loop handler, and updates across the inference pipeline to forward tool and reasoning parameters to the model templates. Feedback focuses on improving the robustness of the image sentinel stripping logic and replacing silent exception handlers with debug logging in the feature detection logic.
| "Please try a different approach or rephrase your request." | ||
| ) |
There was a problem hiding this comment.
The sentinel stripping logic using rsplit("\n__IMAGES__:", 1) is fragile for two reasons:
- If the tool returns multiple images (e.g.,
Text\n__IMAGES__:1\n__IMAGES__:2),rsplitwithmaxsplit=1will only remove the last occurrence, leaving the previous sentinels visible to the model. - If the sentinel appears at the very beginning of the string without a leading newline, the check and the split will fail to match.
Using a simple split on the sentinel itself and taking the first part is more robust for removing the entire images block.
| "Please try a different approach or rephrase your request." | |
| ) | |
| if isinstance(result_for_model, str) and "__IMAGES__:" in result_for_model: | |
| result_for_model = result_for_model.split("__IMAGES__:", 1)[0].rstrip() |
| flags["reasoning_style"] = "reasoning_effort" | ||
| flags["supports_tools"] = False |
There was a problem hiding this comment.
Avoid using broad, silent exception handlers. Even if the failure is expected for some models, logging it at a debug level with the stack trace helps diagnose issues when feature detection doesn't behave as expected.
| flags["reasoning_style"] = "reasoning_effort" | |
| flags["supports_tools"] = False | |
| except Exception: | |
| logger.debug("safetensors_features.gpt_oss_check_failed", exc_info = True) |
References
- Avoid using broad, silent exception handlers like
except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
…e helper Two follow-ups from the comprehensive simulation pass: 1. Bug: assistant prose containing the literal string "<tool_call>" was silently truncated. The STREAMING end-of-stream branch re-yielded the cumulative content with ``strip_tool_markup(..., final=True)`` whenever the parser found no real tool calls. ``final=True`` removes any trailing unclosed ``<tool_call>.*$`` run, which dropped legitimate prose mentioning the literal text (e.g. "the docs say <tool_call> means an LLM tool"). The streaming pass already emitted the cleaned cumulative content via partial strips, so the final re-yield was redundant and only ever hid real text. Drop it; the DRAINING-no-parse fallback now surfaces the raw content_accum instead of the final-stripped version. Adds regression tests covering both the prose case and the case where the tool RESULT text contains the literal "<tool_call>" (the loop must only parse model output, not tool results). 2. Extract _apply_chat_template_for_generation into core/inference/chat_template_helpers.apply_chat_template_for_generation so its kwarg-fallback chain (richest call first, peel off groups on TypeError, propagate real Jinja errors) can be unit-tested without pulling unsloth / torch / transformers into the sandbox. InferenceBackend's method becomes a thin delegate. Tests: TestProseMentioningToolCall: two new tests for the truncation regression and the tool-result-text safety case. TestChatTemplateHelper: five new tests for the helper's fallback chain across template-kwarg permutations and the Jinja-error propagate behaviour. All 29 tests in test_safetensors_tool_loop.py pass; the full related suite (202 tests across test_safetensors_tool_loop, test_openai_tool_ passthrough, test_responses_tool_passthrough, test_inference_model_ validation, test_anthropic_thinking_translation, test_anthropic_code_ execution, test_anthropic_messages) is green.
for more information, see https://pre-commit.ci
Enforce route-side enabled_tools list inside the agentic loop so a model emitting a disabled tool name (e.g. terminal/python when only web_search was advertised) gets an explicit "not enabled" tool result instead of running. Drop the implicit `payload.stream` gate on the safetensors tool branch; behaviour now matches the GGUF server-side tool path and also returns a synchronous ChatCompletion JSON when the OpenAI client did not request streaming. Honour `max_tool_calls_per_message = 0` as "disabled" by gating the route branch on a non-zero budget and short-circuiting the loop when the iteration count is non-positive. Detect tool-call signals mid-stream after the buffering window has expired so prose-then-tool model output no longer leaks raw `<tool_call>` markup to the client before the safety-net parses it. Forward `payload.use_adapter` through `generate_chat_completion_with_tools` and route the per-turn generator through `generate_with_adapter_control` when an adapter is selected, matching the non-tool path. Decouple `auto_heal_tool_calls` from XML-detection itself so a client that disables healing still has well-formed `<tool_call>` parsed; healing only governs malformed-call recovery and the bare- string argument coercion. Call `backend.reset_generation_state` from `sf_tool_stream` on cancel / disconnect / CancelledError / Exception, matching the existing non-tool stream path so adapters and VRAM aren't stuck after a cancel. Replace last-only duplicate-call check with a full-history scan so A->B->A patterns short-circuit alongside strictly consecutive repeats. Pick the canonical healed-argument key per tool name (`code` for python, `command` for terminal, `query` otherwise) so a Hermes- style bare-string argument routes to the parameter the tool actually consumes. Replace the unreachable bare `apply_chat_template` call after the fallback loop with an explicit RuntimeError so future readers don't mistake it for a real fallback. Add an `id_offset` kwarg to `parse_tool_calls_from_text` and bump it across iterations from the agentic loop so tool_call_id values are unique across the conversation rather than restarting at `call_0` each turn. Add a parent-process `_is_gpt_oss_model` on `InferenceOrchestrator` that, together with the existing in-process backend method, now delegates to a single `is_gpt_oss_model_name` helper alongside MODEL_TO_TEMPLATE_MAPPER; the route-level gpt-oss exclusion (and the same check inside `_detect_safetensors_features`) now actually fires when called from the parent process. Hoist the duplicate-call / budget-exhausted / tool-error nudge strings and the error-prefix tuple into `tool_call_parser.py` and consume them from both `llama_cpp.py` and `safetensors_agentic.py` so the two backends stay in lockstep when the wording changes.
is_gpt_oss_model_name("") (or None) used to match because the
substring scan `"" in key` is True for every mapper entry; with at
least one gpt-oss mapping present the helper returned True for an
empty model name. Guard the empty case explicitly so callers that
pass an unset active_model_name do not get a false positive.
Widen the _make_loop fixture default tools to include python and terminal alongside web_search so existing tool-execution tests keep passing under the new route-side allowlist enforcement. Add TestGuardrails covering: disabled-tool allowlist enforcement, empty tools list bypass, max_tool_iterations=0 disabled budget, prose-then-tool no markup leak in streaming mode, auto_heal_tool_calls=False still parses valid XML, non-consecutive duplicate short-circuit, _coerce_arguments canonical-key heal for python (code) and terminal (command), and unique tool_call_id values across loop iterations. Add TestGptOssNameDetection covering substring match for known harmony model, negative for a known non-oss model, and the empty/None guard.
c0dee98 to
fcc9b8a
Compare
for more information, see https://pre-commit.ci
…g push had scrubbed
|
Auto-review verdict: Changes requested Reason: Deterministic gate: studio_unit_tests failed (passed=False) |
for more information, see https://pre-commit.ci
The staging-push mechanism scrubbed and partially restored .github/workflows/ on this branch, which left the PR diff carrying 22 deleted workflow YAML files, 2 modified workflow YAMLs, 2 binary image files from images/, and four unrelated files (studio/backend/main.py, studio/backend/tests/test_middleware.py, studio/backend/utils/models/model_config.py, studio/frontend/src/features/settings/tabs/about-tab.tsx, studio/src-tauri/src/preflight/backend.rs) that came in via the intermediate merges from main. None of these files belong to this PR. Restore each one to its origin/main contents so the merged diff only contains the safetensors agentic loop changes the PR description advertises. After this commit ``git diff origin/main --name-only`` reports exactly the 11 intended files: - studio/backend/core/inference/chat_template_helpers.py (new) - studio/backend/core/inference/inference.py - studio/backend/core/inference/llama_cpp.py - studio/backend/core/inference/orchestrator.py - studio/backend/core/inference/safetensors_agentic.py (new) - studio/backend/core/inference/tool_call_parser.py (new) - studio/backend/core/inference/worker.py - studio/backend/routes/inference.py - studio/backend/tests/test_safetensors_tool_loop.py (new) - studio/backend/utils/datasets/__init__.py - studio/backend/utils/datasets/model_mappings.py All 41 in-repo safetensors-tool-loop tests still pass, plus 100 related existing tests (openai/responses/anthropic/inference_model _validation), plus 86 sim tests and 6 real-executor sim tests.
The orchestrator's iter-1 refactor of llama_cpp.py inadvertently removed _probe_dns_dead and _hf_offline_if_dns_dead (added on main by #5505 between when this branch forked and the orchestrator's merge), which caused tests/test_offline_gguf_cache_fallback.py to fail collection across Python 3.10 / 3.11 / 3.12 / 3.13: ImportError: cannot import name '_hf_offline_if_dns_dead' from 'core.inference.llama_cpp' The original intent of this PR for llama_cpp.py was only to delegate the existing _parse_tool_calls_from_text staticmethod to the shared core/inference/tool_call_parser.py, so this commit: 1. Restores studio/backend/core/inference/llama_cpp.py to origin/main verbatim. 2. Re-adds the single import of parse_tool_calls_from_text from the shared parser module. 3. Re-applies the staticmethod-body swap to call the shared parser. Net delta vs main is now small (the shared parser pulls the body out; the DNS-offline helpers and every other GGUF feature stay exactly as main has them). Test pass count after the fix (all on Linux Python 3.11): * 41 safetensors tool-loop tests * 44 offline GGUF cache fallback tests (the previously failing file) * 217 other related tool / inference / anthropic tests = 302 total
…e/search pills enable
The four capability pills (Web Search, Code Execution, Think, Preserve
Think) all read off LoadResponse.supports_tools and supports_reasoning,
which the route layer derives by running detect_reasoning_flags on the
loaded tokenizer's chat template. For safetensors models that template
lives inside the worker subprocess, and the IPC handshake never sent
it back, so backend.models[name]["chat_template_info"] was {} in the
parent process and every safetensors model surfaced as
supports_tools=False -- pills permanently disabled.
GGUF models worked because the llama-server backend lives in the
parent process and reads the template directly.
Changes:
- worker.py: include the resolved chat_template_info dict (template
string, has_template flag, format_type, template_name,
special_tokens) in the "loaded" IPC reply.
- orchestrator.py: mirror that dict into self.models[name] after a
successful load so route handlers see the same shape the inline
safetensors backend used to expose.
- routes/inference.py: the GGUF already_loaded early-return was the
one GGUF response path that did not emit supports_tools; add it so
reloading an active GGUF model keeps the pills enabled.
- frontend chat-adapter.ts: safetensors auto-load branch only set
supportsTools but not toolsEnabled / codeToolsEnabled, while the
GGUF auto-load branch sets both. Bring safetensors to parity so the
pills default to active when the template accepts tools.
Tests:
- New test_safetensors_capability_advertise.py: 11 tests pinning the
classifier output for a real Qwen3 template, the gpt-oss override,
the None-template fallback, the orchestrator mirror contract, the
worker payload-build snippet, and the route-layer end-to-end lookup.
- Re-ran the 41-case test_safetensors_tool_loop.py plus 182 adjacent
inference / anthropic / openai tests, all green.
After this, unsloth/Qwen3-0.6B (safetensors) advertises the same
capability set as unsloth/Qwen3-0.6B-GGUF.
for more information, see https://pre-commit.ci
|
tested Qwen3.5-2B locally, working as expected for safe tensor thinking, search and code exec. |
Three feedback items rolled in:
1. gemini-code-assist (medium): the __IMAGES__ sentinel stripper in
safetensors_agentic.py used `rsplit("\n__IMAGES__:", 1)`, which
leaves the marker visible to the model when the sentinel appears at
the very start of the result (no leading newline) and when multiple
sentinels follow each other back-to-back. Switched to a
`split("__IMAGES__:", 1)[0].rstrip()` cut so the first occurrence
truncates the entire image block. Two new tests pin both edge
cases: leading sentinel and consecutive sentinels.
2. gemini-code-assist (medium): `_detect_safetensors_features` had a
bare `except Exception: pass` around the gpt-oss override probe.
Replaced with `logger.debug(..., exc_info=True)` so unexpected
classifier failures are at least visible in the structured log.
3. CodeQL py/stack-trace-exposure (alerts #95 and #96, CWE-209): the
safetensors tool stream and non-streaming tool completion paths
passed `_friendly_error(e)` into the SSE/JSON error response. The
helper itself never leaks a raw traceback, but with `tb` and `e` in
the same scope CodeQL flags the taint sink. Tightened both
handlers to log the exception server-side (logger.exception) and
emit a constant "An internal error occurred." string over the
wire. The GGUF tool stream handler is left as-is because it talks
to a managed llama-server with a known error surface that
`_friendly_error` already classifies safely.
Tests: 43 tool-loop + 11 capability-advertise + 190 adjacent
regression tests all green locally.
Also merged origin/main (47 commits) so the branch ships against the
current main rather than its base SHA from a week ago.
for more information, see https://pre-commit.ci
…ng fork Mirrors upstream 6c92b61 onto the cross-OS staging branch: 1. Robust __IMAGES__ sentinel stripping (leading and consecutive sentinels) in safetensors_agentic.py. 2. Debug-log the gpt-oss override probe failure instead of swallowing. 3. Tighten the safetensors tool-stream and JSON tool-completion exception paths so a constant message goes over the wire and the detail stays in logger.exception (CWE-209 / CodeQL alerts 95/96). 4. Two new tests pinning the leading-sentinel and consecutive- sentinel edge cases.
Trim verbose inline / docstring comments across the PR-touched files to single sentences. No behavioural changes; 244 safetensors + adjacent regression tests still green.
After the chat_template_info IPC fix, _detect_safetensors_features
flipped supports_tools=True for every template the GGUF classifier
marks as tool-capable, including Llama-3 and Mistral. Their templates
match the _TOOL_TEMPLATE_MARKERS, but the models emit tool calls in
<|python_tag|> / [TOOL_CALLS] -- not the <tool_call> / <function= XML
the safetensors agentic loop knows how to parse. Enabling the pill for
those families would surface a toggle the parser silently fails on.
The GGUF path is unaffected: llama-server normalises every native
emission format (Llama, Mistral, Qwen, Hermes, ...) into structured
delta.tool_calls before the route layer sees them.
Fix: gate supports_tools on the actual parseable emission markers
(<tool_call> or <function=) appearing in the template's instruction
text. Verified across 10 model families:
Qwen3-0.6B, Qwen3-4B-Instruct-2507 : tools ON (the fix)
Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct: tools OFF (unchanged)
mistral-7b-instruct-v0.3, Mistral-Small : tools OFF (unchanged)
gemma-2-2b-it : tools OFF (unchanged)
DeepSeek-R1-Distill-Qwen-7B : tools OFF, reasoning ON
gpt-oss-20b (+BF16) : tools OFF (Harmony override),
reasoning ON
New tests:
test_detect_safetensors_features_llama3_template_suppresses_tools
test_detect_safetensors_features_mistral_template_suppresses_tools
test_detect_safetensors_features_qwen_tool_call_keeps_tools_on
test_detect_safetensors_features_function_xml_format_keeps_tools_on
248 / 248 capability + tool-loop + adjacent regression tests pass.
Verified against the live unsloth/Qwen3.5-0.8B and unsloth/Qwen3.5-0.8B-GGUF templates (fetched from HF): both produce identical capability dicts with supports_tools=True (template wraps calls as <tool_call>\n<function=name>...) and supports_reasoning=True (enable_thinking). Add a regression test pinning that contract so the family never silently grays out pills.
…nable
The IPC fix landed earlier in this PR plumbs chat_template_info from
the worker subprocess back to the orchestrator so the route layer can
classify reasoning + tool capabilities from the actual tokenizer
template. That fix only patched the regular transformers path
(InferenceBackend._load_chat_template_info); MLXInferenceBackend never
wrote chat_template_info onto self.models[name] at all, so on Apple
Silicon the route still saw {} and advertised supports_tools=False --
exactly what the user reported when testing unsloth/Qwen3.5-0.8B on
Mac.
Fix: mirror _load_chat_template_info inline in
MLXInferenceBackend._populate_chat_template_info and call it from
load_model after the model is in place. Slim version of the full
helper (no MODEL_TO_TEMPLATE_MAPPER lookup -- unused on the route side
for capability detection -- but format_type + special_tokens for
parity with the transformers path).
After this, MLX loads of Qwen3.5-0.8B (and any other tool-capable
model on Apple Silicon) surface the same LoadResponse the GGUF and
non-MLX safetensors paths already do, and the Web Search / Code
Execution / Think pills enable in the UI.
…rve_thinking The route layer forwards these four template kwargs into the worker and then to backend.generate_chat_response. The transformers path already accepted them; the MLX path raised "MLXInferenceBackend.generate_chat_response() got an unexpected keyword argument 'tools'" the moment a Mac user toggled any of the pills the IPC fix had just enabled. Fix: add the four kwargs to generate_chat_response, _generate_text, and _generate_vlm, and route both internal generators through apply_chat_template_for_generation (the same shared helper the transformers path uses) so the kwarg-fallback peels off whatever the template does not accept. Tests: test_mlx_generate_chat_response_accepts_template_kwargs -- static signature pin so the regression cannot land again. test_mlx_generate_text_forwards_kwargs_into_template_helper -- confirms the four kwargs flow through to apply_chat_template_for_ generation untouched.
# Conflicts: # studio/backend/core/inference/llama_cpp.py
for more information, see https://pre-commit.ci
…etensors (unslothai#5520) Adds tools, thinking blocks, code execution, and web search support to the safetensors / transformers and MLX inference backends in Studio, bringing them to parity with the GGUF path. What ships - safetensors / transformers agentic tool loop with cumulative-text state machine, tool-call XML parser, and template kwarg forwarding (tools / enable_thinking / reasoning_effort / preserve_thinking). - MLX backend: same kwargs accepted on Apple Silicon; chat_template_info shipped through worker IPC; pills enable for Qwen / Qwen3 / Qwen3.5 / Gemma reasoning. - Capability classifier (_detect_safetensors_features) gates supports_tools on actual parser-compatible emission markers (<tool_call> / <function=) so Llama-3 / Mistral / Gemma 4 do not advertise toggles the parser cannot honour. - gpt-oss override stays: reasoning on, tools off (Harmony channel, not <tool_call> XML). - CWE-209 hygiene: safetensors SSE error path emits a constant message and logs the trace server-side. Validation - 256 unit tests green (43 tool-loop, 11 capability advertise, 7 MLX backend, 5 main-added, 190 adjacent inference / anthropic / openai regression). - Cross-OS staging CI green on ubuntu-latest / macos-14 / windows-latest plus a dedicated MLX cartesian probe against real unsloth/Qwen3.5-0.8B on macos-14 (CI 26098107440). - Capability parity verified across Qwen3 / Qwen3.5 / Llama-3 / Mistral / Gemma / DeepSeek-R1 / gpt-oss (incl. BF16). - Manual confirmation from Imagineer99 on Qwen3.5-2B: think + search + code exec working. Closes the safetensors / MLX gap with the GGUF backend.
Summary
Today only the GGUF (llama-server) backend in Studio streams tool calls, thinking blocks, sandboxed Python/Bash execution and web search through the agentic loop. The transformers/safetensors backend is inference-only: no
tools, no template-level reasoning controls, no agentic loop. This PR brings safetensors to feature parity for non-vision text chat while leaving the GGUF path untouched.What changes
Backend:
core/inference/tool_call_parser.py(new) -- backend-neutralparse_tool_calls_from_text,strip_tool_markup,has_tool_signaland the shared regex set.LlamaCppBackend._parse_tool_calls_from_textdelegates here so both backends fix-forward together.core/inference/safetensors_agentic.py(new) -- cumulative-text agentic loop with a 3-state buffer (BUFFERING / STREAMING / DRAINING). Emits the samestatus/content/tool_start/tool_end/metadataevents as the GGUF path so the frontend renders both backends identically. Handles duplicate-call short-circuit,__IMAGES__sentinel stripping before model feedback, error nudge,cancel_event,max_tool_iterationscap and a final-answer attempt.core/inference/inference.py--generate_chat_responsenow acceptstools/enable_thinking/reasoning_effort/preserve_thinking. New_apply_chat_template_for_generationpeels unsupported kwargs off the template call in safe order (richest first) so older chat templates still render. Newgenerate_chat_completion_with_toolswraps the agentic loop.core/inference/orchestrator.py-- forwards the new kwargs through both IPC paths (gen and dispatched); newgenerate_chat_completion_with_toolsdrives the loop from the parent process so tools run alongside the existing route-layer plumbing.core/inference/worker.py-- pullstools/enable_thinking/reasoning_effort/preserve_thinkingfrom the cmd dict when present and forwards them tobackend.generate_chat_response.routes/inference.py-- shared_detect_safetensors_featureshelper that calls the existingdetect_reasoning_flagson the loaded tokenizer template so/load, thealready_loadedbranch and/statusall advertise the same flags GGUF does. New safetensors tool-calling SSE branch inPOST /chat/completionsmirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is intentionally gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than<tool_call>XML; GGUF still serves that case.Tests:
tests/test_safetensors_tool_loop.py-- 22 tests covering parser shapes (closed/unclosed JSON,<function=...>XML, embedded</parameter>in code, multiple calls, bad JSON), agentic-loop flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to{\"query\": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, tool error nudge, raised exceptions caught), and control (cancel_eventbreak,max_tool_iterationscap).Backwards compatibility
LlamaCppBackend._parse_tool_calls_from_textkeeps the same signature and behaviour.Test plan
pytest studio/backend/tests/test_safetensors_tool_loop.py-- 22 new tests passpytest studio/backend/tests/test_openai_tool_passthrough.py studio/backend/tests/test_responses_tool_passthrough.py studio/backend/tests/test_inference_model_validation.py-- 99 existing tool tests still pass<topic>prefix, unclosed-but-balanced JSON, 60-char preview truncation, terminal/python/web_search status formatters)"what is the weather in SF", confirmtool_start/tool_endevents arrive and the final answer rendersenable_thinking=true, confirm<think>...</think>blocks render in the UI<think>blocks still render viaHarmonyTextStreamer