Studio: tools, thinking blocks, code execution and web search for safetensors by danielhanchen · Pull Request #5520 · unslothai/unsloth

danielhanchen · 2026-05-17T13:58:06Z

Summary

Today only the GGUF (llama-server) backend in Studio streams tool calls, thinking blocks, sandboxed Python/Bash execution and web search through the agentic loop. The transformers/safetensors backend is inference-only: no tools, no template-level reasoning controls, no agentic loop. This PR brings safetensors to feature parity for non-vision text chat while leaving the GGUF path untouched.

What changes

Backend:

core/inference/tool_call_parser.py (new) -- backend-neutral parse_tool_calls_from_text, strip_tool_markup, has_tool_signal and the shared regex set. LlamaCppBackend._parse_tool_calls_from_text delegates here so both backends fix-forward together.
core/inference/safetensors_agentic.py (new) -- cumulative-text agentic loop with a 3-state buffer (BUFFERING / STREAMING / DRAINING). Emits the same status / content / tool_start / tool_end / metadata events as the GGUF path so the frontend renders both backends identically. Handles duplicate-call short-circuit, __IMAGES__ sentinel stripping before model feedback, error nudge, cancel_event, max_tool_iterations cap and a final-answer attempt.
core/inference/inference.py -- generate_chat_response now accepts tools / enable_thinking / reasoning_effort / preserve_thinking. New _apply_chat_template_for_generation peels unsupported kwargs off the template call in safe order (richest first) so older chat templates still render. New generate_chat_completion_with_tools wraps the agentic loop.
core/inference/orchestrator.py -- forwards the new kwargs through both IPC paths (gen and dispatched); new generate_chat_completion_with_tools drives the loop from the parent process so tools run alongside the existing route-layer plumbing.
core/inference/worker.py -- pulls tools / enable_thinking / reasoning_effort / preserve_thinking from the cmd dict when present and forwards them to backend.generate_chat_response.
routes/inference.py -- shared _detect_safetensors_features helper that calls the existing detect_reasoning_flags on the loaded tokenizer template so /load, the already_loaded branch and /status all advertise the same flags GGUF does. New safetensors tool-calling SSE branch in POST /chat/completions mirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is intentionally gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than <tool_call> XML; GGUF still serves that case.

Tests:

tests/test_safetensors_tool_loop.py -- 22 tests covering parser shapes (closed/unclosed JSON, <function=...> XML, embedded </parameter> in code, multiple calls, bad JSON), agentic-loop flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to {\"query\": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, tool error nudge, raised exceptions caught), and control (cancel_event break, max_tool_iterations cap).

Backwards compatibility

LlamaCppBackend._parse_tool_calls_from_text keeps the same signature and behaviour.
All new IPC kwargs are optional and only added to the cmd dict when set, so older worker payloads are unaffected.
The SSE event protocol matches the existing GGUF tool stream so the frontend tool UI works unchanged.
Vision turns and gpt-oss are explicitly gated out of the new tool path; both keep their pre-PR behaviour.

Test plan

pytest studio/backend/tests/test_safetensors_tool_loop.py -- 22 new tests pass
pytest studio/backend/tests/test_openai_tool_passthrough.py studio/backend/tests/test_responses_tool_passthrough.py studio/backend/tests/test_inference_model_validation.py -- 99 existing tool tests still pass
Cross-platform simulation (Unix and Windows path sentinels, IDN URLs, long streams, false-positive <topic> prefix, unclosed-but-balanced JSON, 60-char preview truncation, terminal/python/web_search status formatters)
Route-integration simulation (template classification for Qwen3 / Llama 3.1 / gpt-oss / plain, orchestrator IPC kwarg forwarding, worker kwarg forwarding)
Load a tool-capable safetensors model (Qwen3 / Llama 3.1) in Studio, enable tools, send "what is the weather in SF", confirm tool_start / tool_end events arrive and the final answer renders
Same load with enable_thinking=true, confirm <think>...</think> blocks render in the UI
Vision model + tools toggle on: tool path stays disabled, plain chat path runs
gpt-oss safetensors: tools toggle is hidden, reasoning toggle is shown, Harmony <think> blocks still render via HarmonyTextStreamer

The GGUF/llama-server backend already streams tool_start/tool_end events, strips <tool_call> XML, parses <think> blocks, and runs an agentic loop through web_search / python / terminal. The transformers/safetensors backend was inference-only: no tools, no template-level reasoning controls, no agentic loop. This change brings safetensors to parity for non-vision text chat while leaving the GGUF path untouched. Backend changes: core/inference/tool_call_parser.py (new): backend-neutral parse_tool_calls_from_text, strip_tool_markup, has_tool_signal, and shared regex/strip patterns. LlamaCppBackend._parse_tool_calls_from_text delegates here, so both paths fix-forward together. core/inference/safetensors_agentic.py (new): cumulative-text agentic loop with a 3-state buffer (BUFFERING, STREAMING, DRAINING). Yields the same status / content / tool_start / tool_end / metadata events the GGUF path already emits. Handles duplicate-call short-circuit, __IMAGES__ sentinel stripping before model feedback, error-prefix tagging, cancel_event, and max_tool_iterations capping. core/inference/inference.py: generate_chat_response now accepts tools / enable_thinking / reasoning_effort / preserve_thinking; _apply_chat_template_for_generation peels unsupported kwargs off the template call in safe order (richest first). New generate_chat_completion_with_tools method wraps the agentic loop. core/inference/orchestrator.py: forwards the new kwargs through IPC (gen + dispatched paths); adds generate_chat_completion_with_tools that drives the loop from the parent process. core/inference/worker.py: pulls tools/enable_thinking/reasoning_effort/ preserve_thinking from the cmd dict when present and forwards to backend.generate_chat_response. routes/inference.py: shared _detect_safetensors_features helper that calls detect_reasoning_flags on the loaded tokenizer template so the load/already_loaded/status endpoints all advertise the same flags GGUF does. New safetensors tool-calling SSE branch in POST /chat/completions that mirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than <tool_call> XML; GGUF still serves that case. Tests: tests/test_safetensors_tool_loop.py: 22 tests covering parser shapes (closed/unclosed JSON, function/parameter XML, embedded </parameter> in code, multiple calls, bad JSON), agentic-loop control flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to {"query": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, tool error nudge, raised exceptions caught), and control (cancel_event break, max_tool_iterations cap). Backwards compatibility: LlamaCppBackend._parse_tool_calls_from_text keeps the same signature and behaviour. All new IPC kwargs are optional and only added to the cmd dict when set, so older worker payloads are unaffected. The SSE event protocol matches the existing GGUF tool stream so the frontend tool UI works unchanged.

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request implements an agentic tool loop for the safetensors/transformers backend, aligning its capabilities with the GGUF path. Key changes include the introduction of a backend-neutral tool-call XML parser, a new safetensors-specific agentic loop handler, and updates across the inference pipeline to forward tool and reasoning parameters to the model templates. Feedback focuses on improving the robustness of the image sentinel stripping logic and replacing silent exception handlers with debug logging in the feature detection logic.

gemini-code-assist · 2026-05-17T14:00:48Z

+                    "Please try a different approach or rephrase your request."
+                )


The sentinel stripping logic using rsplit("\n__IMAGES__:", 1) is fragile for two reasons:

If the tool returns multiple images (e.g., Text\n__IMAGES__:1\n__IMAGES__:2), rsplit with maxsplit=1 will only remove the last occurrence, leaving the previous sentinels visible to the model.

If the sentinel appears at the very beginning of the string without a leading newline, the check and the split will fail to match.

Using a simple split on the sentinel itself and taking the first part is more robust for removing the entire images block.

Suggested change

"Please try a different approach or rephrase your request."

)

if isinstance(result_for_model, str) and "__IMAGES__:" in result_for_model:

result_for_model = result_for_model.split("__IMAGES__:", 1)[0].rstrip()

gemini-code-assist · 2026-05-17T14:00:49Z

+            flags["reasoning_style"] = "reasoning_effort"
+            flags["supports_tools"] = False


Avoid using broad, silent exception handlers. Even if the failure is expected for some models, logging it at a debug level with the stack trace helps diagnose issues when feature detection doesn't behave as expected.

Suggested change

flags["reasoning_style"] = "reasoning_effort"

flags["supports_tools"] = False

except Exception:

logger.debug("safetensors_features.gpt_oss_check_failed", exc_info = True)

References

Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

…e helper Two follow-ups from the comprehensive simulation pass: 1. Bug: assistant prose containing the literal string "<tool_call>" was silently truncated. The STREAMING end-of-stream branch re-yielded the cumulative content with ``strip_tool_markup(..., final=True)`` whenever the parser found no real tool calls. ``final=True`` removes any trailing unclosed ``<tool_call>.*$`` run, which dropped legitimate prose mentioning the literal text (e.g. "the docs say <tool_call> means an LLM tool"). The streaming pass already emitted the cleaned cumulative content via partial strips, so the final re-yield was redundant and only ever hid real text. Drop it; the DRAINING-no-parse fallback now surfaces the raw content_accum instead of the final-stripped version. Adds regression tests covering both the prose case and the case where the tool RESULT text contains the literal "<tool_call>" (the loop must only parse model output, not tool results). 2. Extract _apply_chat_template_for_generation into core/inference/chat_template_helpers.apply_chat_template_for_generation so its kwarg-fallback chain (richest call first, peel off groups on TypeError, propagate real Jinja errors) can be unit-tested without pulling unsloth / torch / transformers into the sandbox. InferenceBackend's method becomes a thin delegate. Tests: TestProseMentioningToolCall: two new tests for the truncation regression and the tool-result-text safety case. TestChatTemplateHelper: five new tests for the helper's fallback chain across template-kwarg permutations and the Jinja-error propagate behaviour. All 29 tests in test_safetensors_tool_loop.py pass; the full related suite (202 tests across test_safetensors_tool_loop, test_openai_tool_ passthrough, test_responses_tool_passthrough, test_inference_model_ validation, test_anthropic_thinking_translation, test_anthropic_code_ execution, test_anthropic_messages) is green.

for more information, see https://pre-commit.ci

Enforce route-side enabled_tools list inside the agentic loop so a model emitting a disabled tool name (e.g. terminal/python when only web_search was advertised) gets an explicit "not enabled" tool result instead of running. Drop the implicit `payload.stream` gate on the safetensors tool branch; behaviour now matches the GGUF server-side tool path and also returns a synchronous ChatCompletion JSON when the OpenAI client did not request streaming. Honour `max_tool_calls_per_message = 0` as "disabled" by gating the route branch on a non-zero budget and short-circuiting the loop when the iteration count is non-positive. Detect tool-call signals mid-stream after the buffering window has expired so prose-then-tool model output no longer leaks raw `<tool_call>` markup to the client before the safety-net parses it. Forward `payload.use_adapter` through `generate_chat_completion_with_tools` and route the per-turn generator through `generate_with_adapter_control` when an adapter is selected, matching the non-tool path. Decouple `auto_heal_tool_calls` from XML-detection itself so a client that disables healing still has well-formed `<tool_call>` parsed; healing only governs malformed-call recovery and the bare- string argument coercion. Call `backend.reset_generation_state` from `sf_tool_stream` on cancel / disconnect / CancelledError / Exception, matching the existing non-tool stream path so adapters and VRAM aren't stuck after a cancel. Replace last-only duplicate-call check with a full-history scan so A->B->A patterns short-circuit alongside strictly consecutive repeats. Pick the canonical healed-argument key per tool name (`code` for python, `command` for terminal, `query` otherwise) so a Hermes- style bare-string argument routes to the parameter the tool actually consumes. Replace the unreachable bare `apply_chat_template` call after the fallback loop with an explicit RuntimeError so future readers don't mistake it for a real fallback. Add an `id_offset` kwarg to `parse_tool_calls_from_text` and bump it across iterations from the agentic loop so tool_call_id values are unique across the conversation rather than restarting at `call_0` each turn. Add a parent-process `_is_gpt_oss_model` on `InferenceOrchestrator` that, together with the existing in-process backend method, now delegates to a single `is_gpt_oss_model_name` helper alongside MODEL_TO_TEMPLATE_MAPPER; the route-level gpt-oss exclusion (and the same check inside `_detect_safetensors_features`) now actually fires when called from the parent process. Hoist the duplicate-call / budget-exhausted / tool-error nudge strings and the error-prefix tuple into `tool_call_parser.py` and consume them from both `llama_cpp.py` and `safetensors_agentic.py` so the two backends stay in lockstep when the wording changes.

is_gpt_oss_model_name("") (or None) used to match because the substring scan `"" in key` is True for every mapper entry; with at least one gpt-oss mapping present the helper returned True for an empty model name. Guard the empty case explicitly so callers that pass an unset active_model_name do not get a false positive.

Widen the _make_loop fixture default tools to include python and terminal alongside web_search so existing tool-execution tests keep passing under the new route-side allowlist enforcement. Add TestGuardrails covering: disabled-tool allowlist enforcement, empty tools list bypass, max_tool_iterations=0 disabled budget, prose-then-tool no markup leak in streaming mode, auto_heal_tool_calls=False still parses valid XML, non-consecutive duplicate short-circuit, _coerce_arguments canonical-key heal for python (code) and terminal (command), and unique tool_call_id values across loop iterations. Add TestGptOssNameDetection covering substring match for known harmony model, negative for a known non-oss model, and the empty/None guard.

for more information, see https://pre-commit.ci

…g push had scrubbed

danielhanchen · 2026-05-18T04:51:08Z

Auto-review verdict: Changes requested

Reason: Deterministic gate: studio_unit_tests failed (passed=False)

for more information, see https://pre-commit.ci

The staging-push mechanism scrubbed and partially restored .github/workflows/ on this branch, which left the PR diff carrying 22 deleted workflow YAML files, 2 modified workflow YAMLs, 2 binary image files from images/, and four unrelated files (studio/backend/main.py, studio/backend/tests/test_middleware.py, studio/backend/utils/models/model_config.py, studio/frontend/src/features/settings/tabs/about-tab.tsx, studio/src-tauri/src/preflight/backend.rs) that came in via the intermediate merges from main. None of these files belong to this PR. Restore each one to its origin/main contents so the merged diff only contains the safetensors agentic loop changes the PR description advertises. After this commit ``git diff origin/main --name-only`` reports exactly the 11 intended files: - studio/backend/core/inference/chat_template_helpers.py (new) - studio/backend/core/inference/inference.py - studio/backend/core/inference/llama_cpp.py - studio/backend/core/inference/orchestrator.py - studio/backend/core/inference/safetensors_agentic.py (new) - studio/backend/core/inference/tool_call_parser.py (new) - studio/backend/core/inference/worker.py - studio/backend/routes/inference.py - studio/backend/tests/test_safetensors_tool_loop.py (new) - studio/backend/utils/datasets/__init__.py - studio/backend/utils/datasets/model_mappings.py All 41 in-repo safetensors-tool-loop tests still pass, plus 100 related existing tests (openai/responses/anthropic/inference_model _validation), plus 86 sim tests and 6 real-executor sim tests.

…commit

The orchestrator's iter-1 refactor of llama_cpp.py inadvertently removed _probe_dns_dead and _hf_offline_if_dns_dead (added on main by #5505 between when this branch forked and the orchestrator's merge), which caused tests/test_offline_gguf_cache_fallback.py to fail collection across Python 3.10 / 3.11 / 3.12 / 3.13: ImportError: cannot import name '_hf_offline_if_dns_dead' from 'core.inference.llama_cpp' The original intent of this PR for llama_cpp.py was only to delegate the existing _parse_tool_calls_from_text staticmethod to the shared core/inference/tool_call_parser.py, so this commit: 1. Restores studio/backend/core/inference/llama_cpp.py to origin/main verbatim. 2. Re-adds the single import of parse_tool_calls_from_text from the shared parser module. 3. Re-applies the staticmethod-body swap to call the shared parser. Net delta vs main is now small (the shared parser pulls the body out; the DNS-offline helpers and every other GGUF feature stay exactly as main has them). Test pass count after the fix (all on Linux Python 3.11): * 41 safetensors tool-loop tests * 44 offline GGUF cache fallback tests (the previously failing file) * 217 other related tool / inference / anthropic tests = 302 total

…e/search pills enable The four capability pills (Web Search, Code Execution, Think, Preserve Think) all read off LoadResponse.supports_tools and supports_reasoning, which the route layer derives by running detect_reasoning_flags on the loaded tokenizer's chat template. For safetensors models that template lives inside the worker subprocess, and the IPC handshake never sent it back, so backend.models[name]["chat_template_info"] was {} in the parent process and every safetensors model surfaced as supports_tools=False -- pills permanently disabled. GGUF models worked because the llama-server backend lives in the parent process and reads the template directly. Changes: - worker.py: include the resolved chat_template_info dict (template string, has_template flag, format_type, template_name, special_tokens) in the "loaded" IPC reply. - orchestrator.py: mirror that dict into self.models[name] after a successful load so route handlers see the same shape the inline safetensors backend used to expose. - routes/inference.py: the GGUF already_loaded early-return was the one GGUF response path that did not emit supports_tools; add it so reloading an active GGUF model keeps the pills enabled. - frontend chat-adapter.ts: safetensors auto-load branch only set supportsTools but not toolsEnabled / codeToolsEnabled, while the GGUF auto-load branch sets both. Bring safetensors to parity so the pills default to active when the template accepts tools. Tests: - New test_safetensors_capability_advertise.py: 11 tests pinning the classifier output for a real Qwen3 template, the gpt-oss override, the None-template fallback, the orchestrator mirror contract, the worker payload-build snippet, and the route-layer end-to-end lookup. - Re-ran the 41-case test_safetensors_tool_loop.py plus 182 adjacent inference / anthropic / openai tests, all green. After this, unsloth/Qwen3-0.6B (safetensors) advertises the same capability set as unsloth/Qwen3-0.6B-GGUF.

for more information, see https://pre-commit.ci

Imagineer99 · 2026-05-18T16:04:54Z

tested Qwen3.5-2B locally, working as expected for safe tensor thinking, search and code exec.

Three feedback items rolled in: 1. gemini-code-assist (medium): the __IMAGES__ sentinel stripper in safetensors_agentic.py used `rsplit("\n__IMAGES__:", 1)`, which leaves the marker visible to the model when the sentinel appears at the very start of the result (no leading newline) and when multiple sentinels follow each other back-to-back. Switched to a `split("__IMAGES__:", 1)[0].rstrip()` cut so the first occurrence truncates the entire image block. Two new tests pin both edge cases: leading sentinel and consecutive sentinels. 2. gemini-code-assist (medium): `_detect_safetensors_features` had a bare `except Exception: pass` around the gpt-oss override probe. Replaced with `logger.debug(..., exc_info=True)` so unexpected classifier failures are at least visible in the structured log. 3. CodeQL py/stack-trace-exposure (alerts #95 and #96, CWE-209): the safetensors tool stream and non-streaming tool completion paths passed `_friendly_error(e)` into the SSE/JSON error response. The helper itself never leaks a raw traceback, but with `tb` and `e` in the same scope CodeQL flags the taint sink. Tightened both handlers to log the exception server-side (logger.exception) and emit a constant "An internal error occurred." string over the wire. The GGUF tool stream handler is left as-is because it talks to a managed llama-server with a known error surface that `_friendly_error` already classifies safely. Tests: 43 tool-loop + 11 capability-advertise + 190 adjacent regression tests all green locally. Also merged origin/main (47 commits) so the branch ships against the current main rather than its base SHA from a week ago.

for more information, see https://pre-commit.ci

…ng fork Mirrors upstream 6c92b61 onto the cross-OS staging branch: 1. Robust __IMAGES__ sentinel stripping (leading and consecutive sentinels) in safetensors_agentic.py. 2. Debug-log the gpt-oss override probe failure instead of swallowing. 3. Tighten the safetensors tool-stream and JSON tool-completion exception paths so a constant message goes over the wire and the detail stays in logger.exception (CWE-209 / CodeQL alerts 95/96). 4. Two new tests pinning the leading-sentinel and consecutive- sentinel edge cases.

Trim verbose inline / docstring comments across the PR-touched files to single sentences. No behavioural changes; 244 safetensors + adjacent regression tests still green.

After the chat_template_info IPC fix, _detect_safetensors_features flipped supports_tools=True for every template the GGUF classifier marks as tool-capable, including Llama-3 and Mistral. Their templates match the _TOOL_TEMPLATE_MARKERS, but the models emit tool calls in <|python_tag|> / [TOOL_CALLS] -- not the <tool_call> / <function= XML the safetensors agentic loop knows how to parse. Enabling the pill for those families would surface a toggle the parser silently fails on. The GGUF path is unaffected: llama-server normalises every native emission format (Llama, Mistral, Qwen, Hermes, ...) into structured delta.tool_calls before the route layer sees them. Fix: gate supports_tools on the actual parseable emission markers (<tool_call> or <function=) appearing in the template's instruction text. Verified across 10 model families: Qwen3-0.6B, Qwen3-4B-Instruct-2507 : tools ON (the fix) Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct: tools OFF (unchanged) mistral-7b-instruct-v0.3, Mistral-Small : tools OFF (unchanged) gemma-2-2b-it : tools OFF (unchanged) DeepSeek-R1-Distill-Qwen-7B : tools OFF, reasoning ON gpt-oss-20b (+BF16) : tools OFF (Harmony override), reasoning ON New tests: test_detect_safetensors_features_llama3_template_suppresses_tools test_detect_safetensors_features_mistral_template_suppresses_tools test_detect_safetensors_features_qwen_tool_call_keeps_tools_on test_detect_safetensors_features_function_xml_format_keeps_tools_on 248 / 248 capability + tool-loop + adjacent regression tests pass.

Verified against the live unsloth/Qwen3.5-0.8B and unsloth/Qwen3.5-0.8B-GGUF templates (fetched from HF): both produce identical capability dicts with supports_tools=True (template wraps calls as <tool_call>\n<function=name>...) and supports_reasoning=True (enable_thinking). Add a regression test pinning that contract so the family never silently grays out pills.

…nable The IPC fix landed earlier in this PR plumbs chat_template_info from the worker subprocess back to the orchestrator so the route layer can classify reasoning + tool capabilities from the actual tokenizer template. That fix only patched the regular transformers path (InferenceBackend._load_chat_template_info); MLXInferenceBackend never wrote chat_template_info onto self.models[name] at all, so on Apple Silicon the route still saw {} and advertised supports_tools=False -- exactly what the user reported when testing unsloth/Qwen3.5-0.8B on Mac. Fix: mirror _load_chat_template_info inline in MLXInferenceBackend._populate_chat_template_info and call it from load_model after the model is in place. Slim version of the full helper (no MODEL_TO_TEMPLATE_MAPPER lookup -- unused on the route side for capability detection -- but format_type + special_tokens for parity with the transformers path). After this, MLX loads of Qwen3.5-0.8B (and any other tool-capable model on Apple Silicon) surface the same LoadResponse the GGUF and non-MLX safetensors paths already do, and the Web Search / Code Execution / Think pills enable in the UI.

…rve_thinking The route layer forwards these four template kwargs into the worker and then to backend.generate_chat_response. The transformers path already accepted them; the MLX path raised "MLXInferenceBackend.generate_chat_response() got an unexpected keyword argument 'tools'" the moment a Mac user toggled any of the pills the IPC fix had just enabled. Fix: add the four kwargs to generate_chat_response, _generate_text, and _generate_vlm, and route both internal generators through apply_chat_template_for_generation (the same shared helper the transformers path uses) so the kwarg-fallback peels off whatever the template does not accept. Tests: test_mlx_generate_chat_response_accepts_template_kwargs -- static signature pin so the regression cannot land again. test_mlx_generate_text_forwards_kwargs_into_template_helper -- confirms the four kwargs flow through to apply_chat_template_for_ generation untouched.

# Conflicts: # studio/backend/core/inference/llama_cpp.py

for more information, see https://pre-commit.ci

…etensors (unslothai#5520) Adds tools, thinking blocks, code execution, and web search support to the safetensors / transformers and MLX inference backends in Studio, bringing them to parity with the GGUF path. What ships - safetensors / transformers agentic tool loop with cumulative-text state machine, tool-call XML parser, and template kwarg forwarding (tools / enable_thinking / reasoning_effort / preserve_thinking). - MLX backend: same kwargs accepted on Apple Silicon; chat_template_info shipped through worker IPC; pills enable for Qwen / Qwen3 / Qwen3.5 / Gemma reasoning. - Capability classifier (_detect_safetensors_features) gates supports_tools on actual parser-compatible emission markers (<tool_call> / <function=) so Llama-3 / Mistral / Gemma 4 do not advertise toggles the parser cannot honour. - gpt-oss override stays: reasoning on, tools off (Harmony channel, not <tool_call> XML). - CWE-209 hygiene: safetensors SSE error path emits a constant message and logs the trace server-side. Validation - 256 unit tests green (43 tool-loop, 11 capability advertise, 7 MLX backend, 5 main-added, 190 adjacent inference / anthropic / openai regression). - Cross-OS staging CI green on ubuntu-latest / macos-14 / windows-latest plus a dedicated MLX cartesian probe against real unsloth/Qwen3.5-0.8B on macos-14 (CI 26098107440). - Capability parity verified across Qwen3 / Qwen3.5 / Llama-3 / Mistral / Gemma / DeepSeek-R1 / gpt-oss (incl. BF16). - Manual confirmation from Imagineer99 on Qwen3.5-2B: think + search + code exec working. Closes the safetensors / MLX gap with the GGUF backend.

danielhanchen requested a review from rolandtannous as a code owner May 17, 2026 13:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

c55b6b3

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

github-advanced-security AI found potential problems May 17, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Fixed

danielhanchen and others added 3 commits May 18, 2026 02:47

[pre-commit.ci] auto fixes from pre-commit.com hooks

64e2c18

for more information, see https://pre-commit.ci

Scrub .github/workflows for staging push (matches staging base)

f79a265

This was referenced May 18, 2026

Studio: tools, thinking blocks, code execution and web search for safetensors unslothai/unsloth-staging-1#84

Open

[tests] Studio: tools, thinking blocks, code execution and web search for safetensors danielhanchen/unsloth-staging-2#122

Closed

danielhanchen added the auto-reviewing Auto-review in progress label May 18, 2026

danielhanchen added 6 commits May 18, 2026 03:51

Merge origin/main into head

42ef0d3

Scrub leaked references from comments and string literals

fcc9b8a

Scrub leaked references from comments and string literals

46c1b2a

danielhanchen force-pushed the studio-safetensors-tools branch from c0dee98 to fcc9b8a Compare May 18, 2026 04:46

pre-commit-ci Bot and others added 4 commits May 18, 2026 04:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

9491c84

for more information, see https://pre-commit.ci

Restore main-pickup workflows and offline-gguf-cache test that stagin…

5d6fcc7

…g push had scrubbed

Merge tests branch into head

219a223

Sync .github/workflows with upstream author branch

8746297

danielhanchen added auto-review-failed Auto-review rejected the PR and removed auto-reviewing Auto-review in progress labels May 18, 2026

pre-commit-ci Bot and others added 4 commits May 18, 2026 04:51

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab36225

for more information, see https://pre-commit.ci

Drop temp/staging_fixes scratch files accidentally added in previous …

c5b987c

…commit

Merge remote-tracking branch 'origin/main' into studio-safetensors-tools

50c209a

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Fixed

danielhanchen mentioned this pull request May 18, 2026

[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows) danielhanchen/unsloth-staging-2#126

Open

danielhanchen and others added 2 commits May 18, 2026 12:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

7c4566e

for more information, see https://pre-commit.ci

danielhanchen and others added 3 commits May 19, 2026 06:08

Merge remote-tracking branch 'origin/main' into studio-safetensors-tools

3793f03

[pre-commit.ci] auto fixes from pre-commit.com hooks

24266c5

for more information, see https://pre-commit.ci

danielhanchen and others added 7 commits May 19, 2026 06:38

Studio safetensors: tighten comments

3f03ebb

Trim verbose inline / docstring comments across the PR-touched files to single sentences. No behavioural changes; 244 safetensors + adjacent regression tests still green.

Merge remote-tracking branch 'origin/main' into studio-safetensors-tools

b3ac068

# Conflicts: # studio/backend/core/inference/llama_cpp.py

[pre-commit.ci] auto fixes from pre-commit.com hooks

124c981

for more information, see https://pre-commit.ci

danielhanchen merged commit bb4eb88 into main May 19, 2026
7 of 34 checks passed

danielhanchen deleted the studio-safetensors-tools branch May 19, 2026 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Studio: tools, thinking blocks, code execution and web search for safetensors#5520

Studio: tools, thinking blocks, code execution and web search for safetensors#5520
danielhanchen merged 32 commits into
mainfrom
studio-safetensors-tools

danielhanchen commented May 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

gemini-code-assist Bot May 17, 2026

Uh oh!

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

Uh oh!

Imagineer99 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		"Please try a different approach or rephrase your request."
		)

-                    "Please try a different approach or rephrase your request."
-                )
+            if isinstance(result_for_model, str) and "__IMAGES__:" in result_for_model:
+                result_for_model = result_for_model.split("__IMAGES__:", 1)[0].rstrip()

		flags["reasoning_style"] = "reasoning_effort"
		flags["supports_tools"] = False

Uh oh!

Conversation

danielhanchen commented May 17, 2026

Summary

What changes

Backwards compatibility

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

Uh oh!

Imagineer99 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants