Skip to content

Studio: tools, thinking blocks, code execution and web search for safetensors#5520

Merged
danielhanchen merged 32 commits into
mainfrom
studio-safetensors-tools
May 19, 2026
Merged

Studio: tools, thinking blocks, code execution and web search for safetensors#5520
danielhanchen merged 32 commits into
mainfrom
studio-safetensors-tools

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Today only the GGUF (llama-server) backend in Studio streams tool calls, thinking blocks, sandboxed Python/Bash execution and web search through the agentic loop. The transformers/safetensors backend is inference-only: no tools, no template-level reasoning controls, no agentic loop. This PR brings safetensors to feature parity for non-vision text chat while leaving the GGUF path untouched.

What changes

Backend:

  • core/inference/tool_call_parser.py (new) -- backend-neutral parse_tool_calls_from_text, strip_tool_markup, has_tool_signal and the shared regex set. LlamaCppBackend._parse_tool_calls_from_text delegates here so both backends fix-forward together.
  • core/inference/safetensors_agentic.py (new) -- cumulative-text agentic loop with a 3-state buffer (BUFFERING / STREAMING / DRAINING). Emits the same status / content / tool_start / tool_end / metadata events as the GGUF path so the frontend renders both backends identically. Handles duplicate-call short-circuit, __IMAGES__ sentinel stripping before model feedback, error nudge, cancel_event, max_tool_iterations cap and a final-answer attempt.
  • core/inference/inference.py -- generate_chat_response now accepts tools / enable_thinking / reasoning_effort / preserve_thinking. New _apply_chat_template_for_generation peels unsupported kwargs off the template call in safe order (richest first) so older chat templates still render. New generate_chat_completion_with_tools wraps the agentic loop.
  • core/inference/orchestrator.py -- forwards the new kwargs through both IPC paths (gen and dispatched); new generate_chat_completion_with_tools drives the loop from the parent process so tools run alongside the existing route-layer plumbing.
  • core/inference/worker.py -- pulls tools / enable_thinking / reasoning_effort / preserve_thinking from the cmd dict when present and forwards them to backend.generate_chat_response.
  • routes/inference.py -- shared _detect_safetensors_features helper that calls the existing detect_reasoning_flags on the loaded tokenizer template so /load, the already_loaded branch and /status all advertise the same flags GGUF does. New safetensors tool-calling SSE branch in POST /chat/completions mirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is intentionally gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than <tool_call> XML; GGUF still serves that case.

Tests:

  • tests/test_safetensors_tool_loop.py -- 22 tests covering parser shapes (closed/unclosed JSON, <function=...> XML, embedded </parameter> in code, multiple calls, bad JSON), agentic-loop flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to {\"query\": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, tool error nudge, raised exceptions caught), and control (cancel_event break, max_tool_iterations cap).

Backwards compatibility

  • LlamaCppBackend._parse_tool_calls_from_text keeps the same signature and behaviour.
  • All new IPC kwargs are optional and only added to the cmd dict when set, so older worker payloads are unaffected.
  • The SSE event protocol matches the existing GGUF tool stream so the frontend tool UI works unchanged.
  • Vision turns and gpt-oss are explicitly gated out of the new tool path; both keep their pre-PR behaviour.

Test plan

  • pytest studio/backend/tests/test_safetensors_tool_loop.py -- 22 new tests pass
  • pytest studio/backend/tests/test_openai_tool_passthrough.py studio/backend/tests/test_responses_tool_passthrough.py studio/backend/tests/test_inference_model_validation.py -- 99 existing tool tests still pass
  • Cross-platform simulation (Unix and Windows path sentinels, IDN URLs, long streams, false-positive <topic> prefix, unclosed-but-balanced JSON, 60-char preview truncation, terminal/python/web_search status formatters)
  • Route-integration simulation (template classification for Qwen3 / Llama 3.1 / gpt-oss / plain, orchestrator IPC kwarg forwarding, worker kwarg forwarding)
  • Load a tool-capable safetensors model (Qwen3 / Llama 3.1) in Studio, enable tools, send "what is the weather in SF", confirm tool_start / tool_end events arrive and the final answer renders
  • Same load with enable_thinking=true, confirm <think>...</think> blocks render in the UI
  • Vision model + tools toggle on: tool path stays disabled, plain chat path runs
  • gpt-oss safetensors: tools toggle is hidden, reasoning toggle is shown, Harmony <think> blocks still render via HarmonyTextStreamer

The GGUF/llama-server backend already streams tool_start/tool_end events,
strips <tool_call> XML, parses <think> blocks, and runs an agentic loop
through web_search / python / terminal. The transformers/safetensors
backend was inference-only: no tools, no template-level reasoning
controls, no agentic loop. This change brings safetensors to parity for
non-vision text chat while leaving the GGUF path untouched.

Backend changes:

core/inference/tool_call_parser.py (new): backend-neutral
parse_tool_calls_from_text, strip_tool_markup, has_tool_signal, and
shared regex/strip patterns. LlamaCppBackend._parse_tool_calls_from_text
delegates here, so both paths fix-forward together.
core/inference/safetensors_agentic.py (new): cumulative-text agentic
loop with a 3-state buffer (BUFFERING, STREAMING, DRAINING). Yields
the same status / content / tool_start / tool_end / metadata events
the GGUF path already emits. Handles duplicate-call short-circuit,
__IMAGES__ sentinel stripping before model feedback, error-prefix
tagging, cancel_event, and max_tool_iterations capping.
core/inference/inference.py: generate_chat_response now accepts
tools / enable_thinking / reasoning_effort / preserve_thinking;
_apply_chat_template_for_generation peels unsupported kwargs off the
template call in safe order (richest first). New
generate_chat_completion_with_tools method wraps the agentic loop.
core/inference/orchestrator.py: forwards the new kwargs through IPC
(gen + dispatched paths); adds generate_chat_completion_with_tools
that drives the loop from the parent process.
core/inference/worker.py: pulls tools/enable_thinking/reasoning_effort/
preserve_thinking from the cmd dict when present and forwards to
backend.generate_chat_response.
routes/inference.py: shared _detect_safetensors_features helper that
calls detect_reasoning_flags on the loaded tokenizer template so the
load/already_loaded/status endpoints all advertise the same flags
GGUF does. New safetensors tool-calling SSE branch in
POST /chat/completions that mirrors the GGUF flow (system prompt
nudge, tool subset filtering, stale-XML scrubbing of prior
assistant turns). gpt-oss is gated out of the safetensors tool path
because Harmony uses a dedicated channel for tool calls rather than
<tool_call> XML; GGUF still serves that case.

Tests:

tests/test_safetensors_tool_loop.py: 22 tests covering parser
shapes (closed/unclosed JSON, function/parameter XML, embedded
</parameter> in code, multiple calls, bad JSON), agentic-loop
control flow (plain answers, single tool then answer, truncated
unclosed call, JSON-string arguments healed to {"query": ...}),
behaviour (duplicate-call short-circuit, image-sentinel survival,
tool error nudge, raised exceptions caught), and control
(cancel_event break, max_tool_iterations cap).

Backwards compatibility:

LlamaCppBackend._parse_tool_calls_from_text keeps the same signature
and behaviour.
All new IPC kwargs are optional and only added to the cmd dict when
set, so older worker payloads are unaffected.
The SSE event protocol matches the existing GGUF tool stream so the
frontend tool UI works unchanged.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an agentic tool loop for the safetensors/transformers backend, aligning its capabilities with the GGUF path. Key changes include the introduction of a backend-neutral tool-call XML parser, a new safetensors-specific agentic loop handler, and updates across the inference pipeline to forward tool and reasoning parameters to the model templates. Feedback focuses on improving the robustness of the image sentinel stripping logic and replacing silent exception handlers with debug logging in the feature detection logic.

Comment on lines +343 to +344
"Please try a different approach or rephrase your request."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sentinel stripping logic using rsplit("\n__IMAGES__:", 1) is fragile for two reasons:

  1. If the tool returns multiple images (e.g., Text\n__IMAGES__:1\n__IMAGES__:2), rsplit with maxsplit=1 will only remove the last occurrence, leaving the previous sentinels visible to the model.
  2. If the sentinel appears at the very beginning of the string without a leading newline, the check and the split will fail to match.

Using a simple split on the sentinel itself and taking the first part is more robust for removing the entire images block.

Suggested change
"Please try a different approach or rephrase your request."
)
if isinstance(result_for_model, str) and "__IMAGES__:" in result_for_model:
result_for_model = result_for_model.split("__IMAGES__:", 1)[0].rstrip()

Comment on lines +268 to +269
flags["reasoning_style"] = "reasoning_effort"
flags["supports_tools"] = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using broad, silent exception handlers. Even if the failure is expected for some models, logging it at a debug level with the stack trace helps diagnose issues when feature detection doesn't behave as expected.

Suggested change
flags["reasoning_style"] = "reasoning_effort"
flags["supports_tools"] = False
except Exception:
logger.debug("safetensors_features.gpt_oss_check_failed", exc_info = True)
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment thread studio/backend/routes/inference.py Fixed
danielhanchen and others added 3 commits May 18, 2026 02:47
…e helper

Two follow-ups from the comprehensive simulation pass:

1. Bug: assistant prose containing the literal string "<tool_call>" was
silently truncated.

The STREAMING end-of-stream branch re-yielded the cumulative content
with ``strip_tool_markup(..., final=True)`` whenever the parser found
no real tool calls. ``final=True`` removes any trailing unclosed
``<tool_call>.*$`` run, which dropped legitimate prose mentioning the
literal text (e.g. "the docs say <tool_call> means an LLM tool"). The
streaming pass already emitted the cleaned cumulative content via
partial strips, so the final re-yield was redundant and only ever
hid real text. Drop it; the DRAINING-no-parse fallback now surfaces
the raw content_accum instead of the final-stripped version.

Adds regression tests covering both the prose case and the case
where the tool RESULT text contains the literal "<tool_call>" (the
loop must only parse model output, not tool results).

2. Extract _apply_chat_template_for_generation into
core/inference/chat_template_helpers.apply_chat_template_for_generation
so its kwarg-fallback chain (richest call first, peel off groups
on TypeError, propagate real Jinja errors) can be unit-tested
without pulling unsloth / torch / transformers into the sandbox.
InferenceBackend's method becomes a thin delegate.

Tests:

TestProseMentioningToolCall: two new tests for the truncation
regression and the tool-result-text safety case.
TestChatTemplateHelper: five new tests for the helper's fallback
chain across template-kwarg permutations and the Jinja-error
propagate behaviour.

All 29 tests in test_safetensors_tool_loop.py pass; the full related
suite (202 tests across test_safetensors_tool_loop, test_openai_tool_
passthrough, test_responses_tool_passthrough, test_inference_model_
validation, test_anthropic_thinking_translation, test_anthropic_code_
execution, test_anthropic_messages) is green.
Enforce route-side enabled_tools list inside the agentic loop so a
model emitting a disabled tool name (e.g. terminal/python when only
web_search was advertised) gets an explicit "not enabled" tool result
instead of running.
Drop the implicit `payload.stream` gate on the safetensors tool
branch; behaviour now matches the GGUF server-side tool path and
also returns a synchronous ChatCompletion JSON when the OpenAI
client did not request streaming.
Honour `max_tool_calls_per_message = 0` as "disabled" by gating the
route branch on a non-zero budget and short-circuiting the loop when
the iteration count is non-positive.
Detect tool-call signals mid-stream after the buffering window has
expired so prose-then-tool model output no longer leaks raw
`<tool_call>` markup to the client before the safety-net parses it.
Forward `payload.use_adapter` through `generate_chat_completion_with_tools`
and route the per-turn generator through `generate_with_adapter_control`
when an adapter is selected, matching the non-tool path.
Decouple `auto_heal_tool_calls` from XML-detection itself so a
client that disables healing still has well-formed `<tool_call>`
parsed; healing only governs malformed-call recovery and the bare-
string argument coercion.
Call `backend.reset_generation_state` from `sf_tool_stream` on
cancel / disconnect / CancelledError / Exception, matching the
existing non-tool stream path so adapters and VRAM aren't stuck
after a cancel.
Replace last-only duplicate-call check with a full-history scan so
A->B->A patterns short-circuit alongside strictly consecutive
repeats.
Pick the canonical healed-argument key per tool name (`code` for
python, `command` for terminal, `query` otherwise) so a Hermes-
style bare-string argument routes to the parameter the tool
actually consumes.
Replace the unreachable bare `apply_chat_template` call after the
fallback loop with an explicit RuntimeError so future readers don't
mistake it for a real fallback.
Add an `id_offset` kwarg to `parse_tool_calls_from_text` and bump it
across iterations from the agentic loop so tool_call_id values are
unique across the conversation rather than restarting at `call_0`
each turn.
Add a parent-process `_is_gpt_oss_model` on `InferenceOrchestrator`
that, together with the existing in-process backend method, now
delegates to a single `is_gpt_oss_model_name` helper alongside
MODEL_TO_TEMPLATE_MAPPER; the route-level gpt-oss exclusion (and
the same check inside `_detect_safetensors_features`) now actually
fires when called from the parent process.
Hoist the duplicate-call / budget-exhausted / tool-error nudge
strings and the error-prefix tuple into `tool_call_parser.py` and
consume them from both `llama_cpp.py` and `safetensors_agentic.py`
so the two backends stay in lockstep when the wording changes.
is_gpt_oss_model_name("") (or None) used to match because the
substring scan `"" in key` is True for every mapper entry; with at
least one gpt-oss mapping present the helper returned True for an
empty model name. Guard the empty case explicitly so callers that
pass an unset active_model_name do not get a false positive.
Widen the _make_loop fixture default tools to include python and
terminal alongside web_search so existing tool-execution tests keep
passing under the new route-side allowlist enforcement.
Add TestGuardrails covering: disabled-tool allowlist enforcement,
empty tools list bypass, max_tool_iterations=0 disabled budget,
prose-then-tool no markup leak in streaming mode,
auto_heal_tool_calls=False still parses valid XML, non-consecutive
duplicate short-circuit, _coerce_arguments canonical-key heal for
python (code) and terminal (command), and unique tool_call_id
values across loop iterations.
Add TestGptOssNameDetection covering substring match for known
harmony model, negative for a known non-oss model, and the
empty/None guard.
@danielhanchen danielhanchen force-pushed the studio-safetensors-tools branch from c0dee98 to fcc9b8a Compare May 18, 2026 04:46
@danielhanchen danielhanchen added auto-review-failed Auto-review rejected the PR and removed auto-reviewing Auto-review in progress labels May 18, 2026
@danielhanchen

Copy link
Copy Markdown
Member Author

Auto-review verdict: Changes requested

Reason: Deterministic gate: studio_unit_tests failed (passed=False)

pre-commit-ci Bot and others added 4 commits May 18, 2026 04:51
The staging-push mechanism scrubbed and partially restored
.github/workflows/ on this branch, which left the PR diff carrying
22 deleted workflow YAML files, 2 modified workflow YAMLs, 2 binary
image files from images/, and four unrelated files
(studio/backend/main.py, studio/backend/tests/test_middleware.py,
studio/backend/utils/models/model_config.py,
studio/frontend/src/features/settings/tabs/about-tab.tsx,
studio/src-tauri/src/preflight/backend.rs) that came in via the
intermediate merges from main.

None of these files belong to this PR. Restore each one to its
origin/main contents so the merged diff only contains the
safetensors agentic loop changes the PR description advertises.

After this commit ``git diff origin/main --name-only`` reports
exactly the 11 intended files:

- studio/backend/core/inference/chat_template_helpers.py (new)
- studio/backend/core/inference/inference.py
- studio/backend/core/inference/llama_cpp.py
- studio/backend/core/inference/orchestrator.py
- studio/backend/core/inference/safetensors_agentic.py (new)
- studio/backend/core/inference/tool_call_parser.py (new)
- studio/backend/core/inference/worker.py
- studio/backend/routes/inference.py
- studio/backend/tests/test_safetensors_tool_loop.py (new)
- studio/backend/utils/datasets/__init__.py
- studio/backend/utils/datasets/model_mappings.py

All 41 in-repo safetensors-tool-loop tests still pass, plus 100
related existing tests (openai/responses/anthropic/inference_model
_validation), plus 86 sim tests and 6 real-executor sim tests.
Comment thread studio/backend/routes/inference.py Fixed
The orchestrator's iter-1 refactor of llama_cpp.py inadvertently
removed _probe_dns_dead and _hf_offline_if_dns_dead (added on main by
#5505 between when this branch forked and the orchestrator's merge),
which caused tests/test_offline_gguf_cache_fallback.py to fail
collection across Python 3.10 / 3.11 / 3.12 / 3.13:

    ImportError: cannot import name '_hf_offline_if_dns_dead' from
    'core.inference.llama_cpp'

The original intent of this PR for llama_cpp.py was only to delegate
the existing _parse_tool_calls_from_text staticmethod to the shared
core/inference/tool_call_parser.py, so this commit:

1. Restores studio/backend/core/inference/llama_cpp.py to
   origin/main verbatim.
2. Re-adds the single import of parse_tool_calls_from_text from the
   shared parser module.
3. Re-applies the staticmethod-body swap to call the shared parser.

Net delta vs main is now small (the shared parser pulls the body
out; the DNS-offline helpers and every other GGUF feature stay
exactly as main has them).

Test pass count after the fix (all on Linux Python 3.11):

* 41 safetensors tool-loop tests
* 44 offline GGUF cache fallback tests (the previously failing file)
* 217 other related tool / inference / anthropic tests = 302 total
danielhanchen and others added 2 commits May 18, 2026 12:27
…e/search pills enable

The four capability pills (Web Search, Code Execution, Think, Preserve
Think) all read off LoadResponse.supports_tools and supports_reasoning,
which the route layer derives by running detect_reasoning_flags on the
loaded tokenizer's chat template. For safetensors models that template
lives inside the worker subprocess, and the IPC handshake never sent
it back, so backend.models[name]["chat_template_info"] was {} in the
parent process and every safetensors model surfaced as
supports_tools=False -- pills permanently disabled.

GGUF models worked because the llama-server backend lives in the
parent process and reads the template directly.

Changes:
- worker.py: include the resolved chat_template_info dict (template
  string, has_template flag, format_type, template_name,
  special_tokens) in the "loaded" IPC reply.
- orchestrator.py: mirror that dict into self.models[name] after a
  successful load so route handlers see the same shape the inline
  safetensors backend used to expose.
- routes/inference.py: the GGUF already_loaded early-return was the
  one GGUF response path that did not emit supports_tools; add it so
  reloading an active GGUF model keeps the pills enabled.
- frontend chat-adapter.ts: safetensors auto-load branch only set
  supportsTools but not toolsEnabled / codeToolsEnabled, while the
  GGUF auto-load branch sets both. Bring safetensors to parity so the
  pills default to active when the template accepts tools.

Tests:
- New test_safetensors_capability_advertise.py: 11 tests pinning the
  classifier output for a real Qwen3 template, the gpt-oss override,
  the None-template fallback, the orchestrator mirror contract, the
  worker payload-build snippet, and the route-layer end-to-end lookup.
- Re-ran the 41-case test_safetensors_tool_loop.py plus 182 adjacent
  inference / anthropic / openai tests, all green.

After this, unsloth/Qwen3-0.6B (safetensors) advertises the same
capability set as unsloth/Qwen3-0.6B-GGUF.
@Imagineer99

Copy link
Copy Markdown
Collaborator

tested Qwen3.5-2B locally, working as expected for safe tensor thinking, search and code exec.

danielhanchen and others added 3 commits May 19, 2026 06:08
Three feedback items rolled in:

1. gemini-code-assist (medium): the __IMAGES__ sentinel stripper in
   safetensors_agentic.py used `rsplit("\n__IMAGES__:", 1)`, which
   leaves the marker visible to the model when the sentinel appears at
   the very start of the result (no leading newline) and when multiple
   sentinels follow each other back-to-back. Switched to a
   `split("__IMAGES__:", 1)[0].rstrip()` cut so the first occurrence
   truncates the entire image block. Two new tests pin both edge
   cases: leading sentinel and consecutive sentinels.

2. gemini-code-assist (medium): `_detect_safetensors_features` had a
   bare `except Exception: pass` around the gpt-oss override probe.
   Replaced with `logger.debug(..., exc_info=True)` so unexpected
   classifier failures are at least visible in the structured log.

3. CodeQL py/stack-trace-exposure (alerts #95 and #96, CWE-209): the
   safetensors tool stream and non-streaming tool completion paths
   passed `_friendly_error(e)` into the SSE/JSON error response. The
   helper itself never leaks a raw traceback, but with `tb` and `e` in
   the same scope CodeQL flags the taint sink. Tightened both
   handlers to log the exception server-side (logger.exception) and
   emit a constant "An internal error occurred." string over the
   wire. The GGUF tool stream handler is left as-is because it talks
   to a managed llama-server with a known error surface that
   `_friendly_error` already classifies safely.

Tests: 43 tool-loop + 11 capability-advertise + 190 adjacent
regression tests all green locally.

Also merged origin/main (47 commits) so the branch ships against the
current main rather than its base SHA from a week ago.
danielhanchen added a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
…ng fork

Mirrors upstream 6c92b61 onto the cross-OS staging branch:

1. Robust __IMAGES__ sentinel stripping (leading and consecutive
   sentinels) in safetensors_agentic.py.
2. Debug-log the gpt-oss override probe failure instead of swallowing.
3. Tighten the safetensors tool-stream and JSON tool-completion
   exception paths so a constant message goes over the wire and the
   detail stays in logger.exception (CWE-209 / CodeQL alerts 95/96).
4. Two new tests pinning the leading-sentinel and consecutive-
   sentinel edge cases.
danielhanchen and others added 7 commits May 19, 2026 06:38
Trim verbose inline / docstring comments across the PR-touched files
to single sentences. No behavioural changes; 244 safetensors + adjacent
regression tests still green.
After the chat_template_info IPC fix, _detect_safetensors_features
flipped supports_tools=True for every template the GGUF classifier
marks as tool-capable, including Llama-3 and Mistral. Their templates
match the _TOOL_TEMPLATE_MARKERS, but the models emit tool calls in
<|python_tag|> / [TOOL_CALLS] -- not the <tool_call> / <function= XML
the safetensors agentic loop knows how to parse. Enabling the pill for
those families would surface a toggle the parser silently fails on.

The GGUF path is unaffected: llama-server normalises every native
emission format (Llama, Mistral, Qwen, Hermes, ...) into structured
delta.tool_calls before the route layer sees them.

Fix: gate supports_tools on the actual parseable emission markers
(<tool_call> or <function=) appearing in the template's instruction
text. Verified across 10 model families:

  Qwen3-0.6B, Qwen3-4B-Instruct-2507          : tools ON (the fix)
  Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct: tools OFF (unchanged)
  mistral-7b-instruct-v0.3, Mistral-Small     : tools OFF (unchanged)
  gemma-2-2b-it                               : tools OFF (unchanged)
  DeepSeek-R1-Distill-Qwen-7B                 : tools OFF, reasoning ON
  gpt-oss-20b (+BF16)                         : tools OFF (Harmony override),
                                                reasoning ON

New tests:
  test_detect_safetensors_features_llama3_template_suppresses_tools
  test_detect_safetensors_features_mistral_template_suppresses_tools
  test_detect_safetensors_features_qwen_tool_call_keeps_tools_on
  test_detect_safetensors_features_function_xml_format_keeps_tools_on

248 / 248 capability + tool-loop + adjacent regression tests pass.
Verified against the live unsloth/Qwen3.5-0.8B and
unsloth/Qwen3.5-0.8B-GGUF templates (fetched from HF): both produce
identical capability dicts with supports_tools=True (template wraps
calls as <tool_call>\n<function=name>...) and supports_reasoning=True
(enable_thinking). Add a regression test pinning that contract so the
family never silently grays out pills.
…nable

The IPC fix landed earlier in this PR plumbs chat_template_info from
the worker subprocess back to the orchestrator so the route layer can
classify reasoning + tool capabilities from the actual tokenizer
template. That fix only patched the regular transformers path
(InferenceBackend._load_chat_template_info); MLXInferenceBackend never
wrote chat_template_info onto self.models[name] at all, so on Apple
Silicon the route still saw {} and advertised supports_tools=False --
exactly what the user reported when testing unsloth/Qwen3.5-0.8B on
Mac.

Fix: mirror _load_chat_template_info inline in
MLXInferenceBackend._populate_chat_template_info and call it from
load_model after the model is in place. Slim version of the full
helper (no MODEL_TO_TEMPLATE_MAPPER lookup -- unused on the route side
for capability detection -- but format_type + special_tokens for
parity with the transformers path).

After this, MLX loads of Qwen3.5-0.8B (and any other tool-capable
model on Apple Silicon) surface the same LoadResponse the GGUF and
non-MLX safetensors paths already do, and the Web Search / Code
Execution / Think pills enable in the UI.
…rve_thinking

The route layer forwards these four template kwargs into the worker
and then to backend.generate_chat_response. The transformers path
already accepted them; the MLX path raised
"MLXInferenceBackend.generate_chat_response() got an unexpected
keyword argument 'tools'" the moment a Mac user toggled any of the
pills the IPC fix had just enabled.

Fix: add the four kwargs to generate_chat_response, _generate_text,
and _generate_vlm, and route both internal generators through
apply_chat_template_for_generation (the same shared helper the
transformers path uses) so the kwarg-fallback peels off whatever the
template does not accept.

Tests:
  test_mlx_generate_chat_response_accepts_template_kwargs --
  static signature pin so the regression cannot land again.
  test_mlx_generate_text_forwards_kwargs_into_template_helper --
  confirms the four kwargs flow through to apply_chat_template_for_
  generation untouched.
# Conflicts:
#	studio/backend/core/inference/llama_cpp.py
@danielhanchen danielhanchen merged commit bb4eb88 into main May 19, 2026
7 of 34 checks passed
@danielhanchen danielhanchen deleted the studio-safetensors-tools branch May 19, 2026 13:30
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…etensors (unslothai#5520)

Adds tools, thinking blocks, code execution, and web search support to the safetensors / transformers and MLX inference backends in Studio, bringing them to parity with the GGUF path.

What ships
- safetensors / transformers agentic tool loop with cumulative-text state machine, tool-call XML parser, and template kwarg forwarding (tools / enable_thinking / reasoning_effort / preserve_thinking).
- MLX backend: same kwargs accepted on Apple Silicon; chat_template_info shipped through worker IPC; pills enable for Qwen / Qwen3 / Qwen3.5 / Gemma reasoning.
- Capability classifier (_detect_safetensors_features) gates supports_tools on actual parser-compatible emission markers (<tool_call> / <function=) so Llama-3 / Mistral / Gemma 4 do not advertise toggles the parser cannot honour.
- gpt-oss override stays: reasoning on, tools off (Harmony channel, not <tool_call> XML).
- CWE-209 hygiene: safetensors SSE error path emits a constant message and logs the trace server-side.

Validation
- 256 unit tests green (43 tool-loop, 11 capability advertise, 7 MLX backend, 5 main-added, 190 adjacent inference / anthropic / openai regression).
- Cross-OS staging CI green on ubuntu-latest / macos-14 / windows-latest plus a dedicated MLX cartesian probe against real unsloth/Qwen3.5-0.8B on macos-14 (CI 26098107440).
- Capability parity verified across Qwen3 / Qwen3.5 / Llama-3 / Mistral / Gemma / DeepSeek-R1 / gpt-oss (incl. BF16).
- Manual confirmation from Imagineer99 on Qwen3.5-2B: think + search + code exec working.

Closes the safetensors / MLX gap with the GGUF backend.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-review-failed Auto-review rejected the PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants