[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows) by danielhanchen · Pull Request #126 · danielhanchen/unsloth-staging-2

danielhanchen · 2026-05-18T10:44:10Z

Throwaway staging PR to verify the upstream PR unslothai#5520 (Studio safetensors agentic tool loop) tests pass on real ubuntu-latest, macos-14, and windows-latest runners using CPU-only torch + transformers. Burns no minutes on the upstream repo and stays well under the 5-concurrent-Windows-runner cap.

What this contains

The 11 files PR-5520 changes (safetensors_agentic.py, tool_call_parser.py, chat_template_helpers.py, inference.py, llama_cpp.py, orchestrator.py, worker.py, routes/inference.py, tests/test_safetensors_tool_loop.py, utils/datasets/init.py, utils/datasets/model_mappings.py).
One new workflow .github/workflows/safetensors-tool-loop-ci.yml with a 3-cell OS matrix. Every other .github/workflows/ entry is removed so the fork's CI runs in minutes per push.

Tests run per matrix cell

tests/test_safetensors_tool_loop.py -- 41 tests for parser shapes, state machine, allowlist guard, IPC kwarg forwarding, template fallback chain, prose containing literal `<tool_call>`, unique tool_call_id across iterations.
Regression guard against test_openai_tool_passthrough, test_responses_tool_passthrough, test_inference_model_validation, test_anthropic_thinking_translation, test_anthropic_code_execution, test_anthropic_messages.

Triggers

push on this branch
pull_request limited via paths filter to studio/backend/core/inference/, studio/backend/routes/inference.py, studio/backend/tests/test_safetensors_tool_loop.py, studio/backend/utils/datasets/, and the workflow itself
workflow_dispatch
concurrency.cancel-in-progress so each new push supersedes the previous run

Once this is green, the upstream PR is safe to merge.

Bring the upstream PR-5520 changes into this staging fork and add one focused CI workflow that runs the test_safetensors_tool_loop.py suite plus the directly-adjacent tool / inference / anthropic regression suites on ubuntu-latest, macos-14, and windows-latest under CPU-only torch + transformers. Scope: - Copy the 11 files PR-5520 actually touches into the staging fork (core/inference/{tool_call_parser,safetensors_agentic, chat_template_helpers,inference,llama_cpp,orchestrator,worker}.py, routes/inference.py, tests/test_safetensors_tool_loop.py, utils/datasets/{__init__,model_mappings}.py). - Drop every other .github/workflows/ file so the staging fork's CI budget stays well under the 5-concurrent-Windows-runner cap. Each push triggers exactly one workflow with three matrix cells. Workflow shape: - name: Safetensors tool loop CI - matrix: ubuntu-latest, macos-14, windows-latest - Python 3.11 across the board - pip install CPU torch + transformers (matches Studio's existing backend-ci.yml dep shape) plus pytest, fastapi, pydantic, structlog, pyjwt, cryptography, python-multipart, etc. - Steps: 1. tests/test_safetensors_tool_loop.py (41 tests covering parser, state machine, allowlist, IPC kwarg forwarding, template fallback) 2. Regression guard across openai_tool_passthrough, responses_tool_passthrough, inference_model_validation, anthropic_thinking_translation, anthropic_code_execution, anthropic_messages (~260 additional tests) - paths filter limits trigger to the files this PR actually changes + the workflow itself, so unrelated commits do not re-run it. - concurrency.cancel-in-progress: true so each new push supersedes the previous run. CUDA spoof note: the safetensors loop never reaches model.generate; the tests stub the single-turn cumulative generator. We install the CPU torch wheel only because the studio/backend import chain (utils.hardware) imports torch at module scope.

gemini-code-assist

Code Review

This pull request implements agentic tool-calling for safetensors models, mirroring the GGUF backend's functionality. It introduces a shared XML tool-call parser, a backend-neutral chat template helper with fallback logic for reasoning and tool parameters, and improves GGUF's offline model resolution and logging. Feedback identifies a discrepancy in zero-iteration handling between backends and suggests using sorted JSON keys for more robust duplicate tool call detection.

gemini-code-assist · 2026-05-18T10:47:01Z

+    if max_tool_iterations <= 0:
+        # why safe: documented 0 = disabled; mirror the GGUF loop which
+        # produces no iterations under the same setting.
+        yield {"type": "status", "text": ""}
+        return


This explicit return for max_tool_iterations <= 0 creates a discrepancy with the GGUF tool loop implementation. In the GGUF path (llama_cpp.py), setting iterations to 0 still performs one final generation turn (the 'final answer' pass). Here, it produces no generation at all. Removing this guard would allow the loop to run once (iteration 0), which correctly performs a single turn with the 'final answer' nudge if max_tool_iterations is 0, ensuring parity across backends.

gemini-code-assist · 2026-05-18T10:47:01Z

+                "arguments": arguments,
+            }
+
+            tc_key = tool_name + str(arguments)


Using str(arguments) for duplicate detection is sensitive to dictionary key ordering. While modern Python preserves insertion order, LLMs may emit the same arguments in a different order across turns. Using json.dumps with sort_keys=True provides a more robust canonical key for duplicate detection.

Suggested change

tc_key = tool_name + str(arguments)

tc_key = f"{tool_name}|{json.dumps(arguments, sort_keys=True)}"

windows-latest defaults `run:` blocks to PowerShell, which rejects the backslash line continuations used in the pip install steps and treats `\` as a literal directory argument. Setting `shell: bash` matrix-wide routes through Git Bash (preinstalled on the runner) so one script shape works across ubuntu, macos, and windows.

…e/search pills enable Mirrors the upstream studio-safetensors-tools 327c42c commit on the staging branch so the cross-OS smoke workflow exercises the fix too. Before this, the Web Search / Code Execution / Think pills stayed permanently disabled for every safetensors model because the orchestrator/worker IPC bridge never marshalled the resolved tokenizer.chat_template back from the subprocess, so the route layer's capability detector saw chat_template=None and advertised supports_tools=False.

…ng fork Mirrors upstream 6c92b61 onto the cross-OS staging branch: 1. Robust __IMAGES__ sentinel stripping (leading and consecutive sentinels) in safetensors_agentic.py. 2. Debug-log the gpt-oss override probe failure instead of swallowing. 3. Tighten the safetensors tool-stream and JSON tool-completion exception paths so a constant message goes over the wire and the detail stays in logger.exception (CWE-209 / CodeQL alerts 95/96). 4. Two new tests pinning the leading-sentinel and consecutive- sentinel edge cases.

…o fix Adds a dedicated macos-14 job to the staging workflow that: - installs mlx / mlx-lm / mlx-vlm + Studio backend deps - loads unsloth/Qwen3.5-0.8B for real via MLXInferenceBackend - asserts the chat_template_info IPC payload now contains the template - asserts _detect_safetensors_features returns supports_tools=True and supports_reasoning=True Mirrors the upstream MLXInferenceBackend._populate_chat_template_info fix (commit b1b1623) so the staging branch exercises identical code.

Mirrors upstream dafad14 onto the staging branch (MLX kwarg fix) and extends the macos-14 MLX job so it now exercises: 1. backend.load_model("unsloth/Qwen3.5-0.8B") via MLXInferenceBackend 2. chat_template_info populated on backend.models[name] (IPC contract) 3. _detect_safetensors_features returns supports_tools=True + supports_reasoning=True (LoadResponse contract) 4. backend.generate_chat_response actually accepts the four template kwargs and yields cumulative text across 6 cartesian cells: baseline / thinking_on / thinking_off / web_search_only / python_only / both_tools+thinking 5. safetensors_agentic.run_safetensors_tool_loop drives the MLX stack end to end (fake executor stands in for the real web_search / python tools that require browser + sandbox) Cell timeout 12-24 new tokens to keep CI fast; the goal is plumbing, not benchmark quality.

The previous run proved the 6 cartesian cells succeed on real macos-14 MLX with the new kwarg signature: baseline / thinking_on / thinking_off / web_search_only / python_only / both_tools+thinking all produced output, enable_thinking measurably flipped the model into "Thinking Process:" mode, tools schema measurably grew the prompt from 134 -> 1217+ chars. The agentic loop step then hung past the 25-min job timeout because mlx-vlm on a CPU-only macos-14 runner is too slow to complete even 24 new tokens reliably. Move the agentic-loop coverage back to the unit tests (43 cells in test_safetensors_tool_loop.py, all green) and keep the macos probe focused on the kwarg + chat_template_info contracts.

The shared tool_call_parser used by safetensors and MLX now recognises five emission families so the agentic loop sees the same call shape llama-server normalises for GGUF: - Qwen / Hermes <tool_call>{json}</tool_call> - Qwen3.5 / Hermes XML <function=name><parameter=k>v - Llama-3 built-in tools <|python_tag|>NAME.call(k="v", ...) - Llama-3 custom tools <|python_tag|>{"name":..., "parameters":...} - Llama-3.2 bare JSON {"name":..., "parameters":...} (no tag) - Mistral v0.3 / Nemo / Small [TOOL_CALLS] [{...}, ...] - Mistral v11+ / Magistral [TOOL_CALLS]name{json} (may chain) - Ministral / Large 3 [TOOL_CALLS]name[ARGS]{json} - Gemma 4 <|tool_call>call:NAME{k:<|"|>v<|"|>}<tool_call|> Parsers mirror the per-family regexes in llama.cpp's chat-parser.cpp (legacy pre-PEG branch), vLLM's tool_parsers/, and SGLang's function_call/ modules. Output is normalised to OpenAI shape: {id, type:"function", function:{name, arguments(json_string)}}. Truncated emissions (unclosed brackets, missing close tags) are tolerated -- the parser walks balanced braces and falls back to per- object healing. Streaming buffer wakes up on five markers (was two) so the safetensors / MLX state machine drains tool calls instead of leaking them as prose: TOOL_XML_SIGNALS now contains <tool_call>, <function=, <|python_tag|>, [TOOL_CALLS], <|tool_call>. CI: extends the staging cross-OS smoke with a multi-format parser probe job that exercises every emission shape on ubuntu / macos-14 / windows; runs the existing macos-14 MLX Qwen3.5-0.8B cartesian probe unchanged. Tests: 26 new unit tests covering each format (parser + streaming buffer + strip_tool_markup), plus 11 bare-JSON edge cases that guard against false positives on plain prose / tool-message echoes. Refs: llama.cpp commit 34df42f7be common/chat-parser.cpp; vLLM main/tool_parsers/{llama,mistral,gemma}.py; sglang main/function_call/{llama3,mistral,gemma}_format.py.

Mirrors the healing-parity follow-up commit from PR unslothai#5620: 1. GGUF BUFFERING tuple now uses the shared 5-format TOOL_XML_SIGNALS so non-Qwen tool emissions wake the state machine (was only 2 markers). 2. GGUF stream cleanup delegates to the shared strip_tool_markup so leaked markup from any family is removed from assistant content. 3. GGUF per-tool canonical heal key (python -> code, terminal -> command, * -> query) when arguments is a bare string. 4. Safetensors / MLX re-prompt on plan-without-action with _MAX_REPROMPTS=3 + extra iteration slots so re-prompts do not eat the tool budget. Also pulls in core/tool_healing.py which staging was missing (the legacy two-format helper module that llama_cpp.py imports the regex constants from).

Mirrors PR unslothai#5624: three more emission-family parsers for the shared tool_call_parser plus CI updates that exercise the new fixtures cross-OS. - DeepSeek R1 / V3 / V3.1: <｜tool▁calls▁begin｜>...<｜tool▁sep｜>... - GLM 4.5 / 4.6 / 4.7: <tool_call>NAME\n<arg_key>K</arg_key>\n <arg_value>V</arg_value>...</tool_call> - Kimi K2 / Moonshot: <|tool_calls_section_begin|>...<|tool_call_ argument_begin|>... Ported from llama.cpp common/chat-parser.cpp lines 801-913, 1040-1052 (MIT), vLLM tool_parsers/ {deepseekv31, glm4_moe, kimi_k2}_tool_parser.py (Apache-2.0), and SGLang function_call/ {deepseekv31, glm4_moe, kimik2}_detector.py (Apache-2.0). CI multi-format probe extended from 9 to 13 fixtures so all four new families run on ubuntu / macos-14 / windows.

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Fixed

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

danielhanchen added 13 commits May 18, 2026 10:51

Studio safetensors: tighten comments (mirror to staging)

b634225

Studio safetensors: gate supports_tools on parser compatibility (mirror)

cd03898

Studio safetensors: pin Qwen3.5 classifier (mirror)

1cbe154

ci: install backend deps so multi-format parser probe can import

acd696e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows)#126

[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows)#126
danielhanchen wants to merge 14 commits into
mainfrom
safetensors-tool-loop-staging

danielhanchen commented May 18, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	tc_key = tool_name + str(arguments)
	tc_key = f"{tool_name}\|{json.dumps(arguments, sort_keys=True)}"

Conversation

danielhanchen commented May 18, 2026

What this contains

Tests run per matrix cell

Triggers

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants