[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows)#126
[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows)#126danielhanchen wants to merge 14 commits into
Conversation
Bring the upstream PR-5520 changes into this staging fork and add one
focused CI workflow that runs the test_safetensors_tool_loop.py suite
plus the directly-adjacent tool / inference / anthropic regression
suites on ubuntu-latest, macos-14, and windows-latest under CPU-only
torch + transformers.
Scope:
- Copy the 11 files PR-5520 actually touches into the staging fork
(core/inference/{tool_call_parser,safetensors_agentic,
chat_template_helpers,inference,llama_cpp,orchestrator,worker}.py,
routes/inference.py, tests/test_safetensors_tool_loop.py,
utils/datasets/{__init__,model_mappings}.py).
- Drop every other .github/workflows/ file so the staging fork's CI
budget stays well under the 5-concurrent-Windows-runner cap. Each
push triggers exactly one workflow with three matrix cells.
Workflow shape:
- name: Safetensors tool loop CI
- matrix: ubuntu-latest, macos-14, windows-latest
- Python 3.11 across the board
- pip install CPU torch + transformers (matches Studio's existing
backend-ci.yml dep shape) plus pytest, fastapi, pydantic, structlog,
pyjwt, cryptography, python-multipart, etc.
- Steps:
1. tests/test_safetensors_tool_loop.py (41 tests covering parser,
state machine, allowlist, IPC kwarg forwarding, template fallback)
2. Regression guard across openai_tool_passthrough,
responses_tool_passthrough, inference_model_validation,
anthropic_thinking_translation, anthropic_code_execution,
anthropic_messages (~260 additional tests)
- paths filter limits trigger to the files this PR actually changes
+ the workflow itself, so unrelated commits do not re-run it.
- concurrency.cancel-in-progress: true so each new push supersedes
the previous run.
CUDA spoof note: the safetensors loop never reaches model.generate;
the tests stub the single-turn cumulative generator. We install the
CPU torch wheel only because the studio/backend import chain
(utils.hardware) imports torch at module scope.
There was a problem hiding this comment.
Code Review
This pull request implements agentic tool-calling for safetensors models, mirroring the GGUF backend's functionality. It introduces a shared XML tool-call parser, a backend-neutral chat template helper with fallback logic for reasoning and tool parameters, and improves GGUF's offline model resolution and logging. Feedback identifies a discrepancy in zero-iteration handling between backends and suggests using sorted JSON keys for more robust duplicate tool call detection.
| if max_tool_iterations <= 0: | ||
| # why safe: documented 0 = disabled; mirror the GGUF loop which | ||
| # produces no iterations under the same setting. | ||
| yield {"type": "status", "text": ""} | ||
| return |
There was a problem hiding this comment.
This explicit return for max_tool_iterations <= 0 creates a discrepancy with the GGUF tool loop implementation. In the GGUF path (llama_cpp.py), setting iterations to 0 still performs one final generation turn (the 'final answer' pass). Here, it produces no generation at all. Removing this guard would allow the loop to run once (iteration 0), which correctly performs a single turn with the 'final answer' nudge if max_tool_iterations is 0, ensuring parity across backends.
| "arguments": arguments, | ||
| } | ||
|
|
||
| tc_key = tool_name + str(arguments) |
There was a problem hiding this comment.
Using str(arguments) for duplicate detection is sensitive to dictionary key ordering. While modern Python preserves insertion order, LLMs may emit the same arguments in a different order across turns. Using json.dumps with sort_keys=True provides a more robust canonical key for duplicate detection.
| tc_key = tool_name + str(arguments) | |
| tc_key = f"{tool_name}|{json.dumps(arguments, sort_keys=True)}" |
windows-latest defaults `run:` blocks to PowerShell, which rejects the backslash line continuations used in the pip install steps and treats `\` as a literal directory argument. Setting `shell: bash` matrix-wide routes through Git Bash (preinstalled on the runner) so one script shape works across ubuntu, macos, and windows.
…e/search pills enable Mirrors the upstream studio-safetensors-tools 327c42c commit on the staging branch so the cross-OS smoke workflow exercises the fix too. Before this, the Web Search / Code Execution / Think pills stayed permanently disabled for every safetensors model because the orchestrator/worker IPC bridge never marshalled the resolved tokenizer.chat_template back from the subprocess, so the route layer's capability detector saw chat_template=None and advertised supports_tools=False.
…ng fork Mirrors upstream 6c92b61 onto the cross-OS staging branch: 1. Robust __IMAGES__ sentinel stripping (leading and consecutive sentinels) in safetensors_agentic.py. 2. Debug-log the gpt-oss override probe failure instead of swallowing. 3. Tighten the safetensors tool-stream and JSON tool-completion exception paths so a constant message goes over the wire and the detail stays in logger.exception (CWE-209 / CodeQL alerts 95/96). 4. Two new tests pinning the leading-sentinel and consecutive- sentinel edge cases.
…o fix
Adds a dedicated macos-14 job to the staging workflow that:
- installs mlx / mlx-lm / mlx-vlm + Studio backend deps
- loads unsloth/Qwen3.5-0.8B for real via MLXInferenceBackend
- asserts the chat_template_info IPC payload now contains the
template
- asserts _detect_safetensors_features returns supports_tools=True
and supports_reasoning=True
Mirrors the upstream MLXInferenceBackend._populate_chat_template_info
fix (commit b1b1623) so the staging branch exercises identical code.
Mirrors upstream dafad14 onto the staging branch (MLX kwarg fix) and extends the macos-14 MLX job so it now exercises: 1. backend.load_model("unsloth/Qwen3.5-0.8B") via MLXInferenceBackend 2. chat_template_info populated on backend.models[name] (IPC contract) 3. _detect_safetensors_features returns supports_tools=True + supports_reasoning=True (LoadResponse contract) 4. backend.generate_chat_response actually accepts the four template kwargs and yields cumulative text across 6 cartesian cells: baseline / thinking_on / thinking_off / web_search_only / python_only / both_tools+thinking 5. safetensors_agentic.run_safetensors_tool_loop drives the MLX stack end to end (fake executor stands in for the real web_search / python tools that require browser + sandbox) Cell timeout 12-24 new tokens to keep CI fast; the goal is plumbing, not benchmark quality.
The previous run proved the 6 cartesian cells succeed on real macos-14 MLX with the new kwarg signature: baseline / thinking_on / thinking_off / web_search_only / python_only / both_tools+thinking all produced output, enable_thinking measurably flipped the model into "Thinking Process:" mode, tools schema measurably grew the prompt from 134 -> 1217+ chars. The agentic loop step then hung past the 25-min job timeout because mlx-vlm on a CPU-only macos-14 runner is too slow to complete even 24 new tokens reliably. Move the agentic-loop coverage back to the unit tests (43 cells in test_safetensors_tool_loop.py, all green) and keep the macos probe focused on the kwarg + chat_template_info contracts.
The shared tool_call_parser used by safetensors and MLX now recognises
five emission families so the agentic loop sees the same call shape
llama-server normalises for GGUF:
- Qwen / Hermes <tool_call>{json}</tool_call>
- Qwen3.5 / Hermes XML <function=name><parameter=k>v
- Llama-3 built-in tools <|python_tag|>NAME.call(k="v", ...)
- Llama-3 custom tools <|python_tag|>{"name":..., "parameters":...}
- Llama-3.2 bare JSON {"name":..., "parameters":...} (no tag)
- Mistral v0.3 / Nemo / Small [TOOL_CALLS] [{...}, ...]
- Mistral v11+ / Magistral [TOOL_CALLS]name{json} (may chain)
- Ministral / Large 3 [TOOL_CALLS]name[ARGS]{json}
- Gemma 4 <|tool_call>call:NAME{k:<|"|>v<|"|>}<tool_call|>
Parsers mirror the per-family regexes in llama.cpp's chat-parser.cpp
(legacy pre-PEG branch), vLLM's tool_parsers/, and SGLang's
function_call/ modules. Output is normalised to OpenAI shape:
{id, type:"function", function:{name, arguments(json_string)}}.
Truncated emissions (unclosed brackets, missing close tags) are
tolerated -- the parser walks balanced braces and falls back to per-
object healing.
Streaming buffer wakes up on five markers (was two) so the safetensors
/ MLX state machine drains tool calls instead of leaking them as prose:
TOOL_XML_SIGNALS now contains <tool_call>, <function=, <|python_tag|>,
[TOOL_CALLS], <|tool_call>.
CI: extends the staging cross-OS smoke with a multi-format parser
probe job that exercises every emission shape on ubuntu / macos-14 /
windows; runs the existing macos-14 MLX Qwen3.5-0.8B cartesian probe
unchanged.
Tests: 26 new unit tests covering each format (parser + streaming buffer
+ strip_tool_markup), plus 11 bare-JSON edge cases that guard against
false positives on plain prose / tool-message echoes.
Refs: llama.cpp commit 34df42f7be common/chat-parser.cpp;
vLLM main/tool_parsers/{llama,mistral,gemma}.py;
sglang main/function_call/{llama3,mistral,gemma}_format.py.
Mirrors the healing-parity follow-up commit from PR unslothai#5620: 1. GGUF BUFFERING tuple now uses the shared 5-format TOOL_XML_SIGNALS so non-Qwen tool emissions wake the state machine (was only 2 markers). 2. GGUF stream cleanup delegates to the shared strip_tool_markup so leaked markup from any family is removed from assistant content. 3. GGUF per-tool canonical heal key (python -> code, terminal -> command, * -> query) when arguments is a bare string. 4. Safetensors / MLX re-prompt on plan-without-action with _MAX_REPROMPTS=3 + extra iteration slots so re-prompts do not eat the tool budget. Also pulls in core/tool_healing.py which staging was missing (the legacy two-format helper module that llama_cpp.py imports the regex constants from).
Mirrors PR unslothai#5624: three more emission-family parsers for the shared tool_call_parser plus CI updates that exercise the new fixtures cross-OS. - DeepSeek R1 / V3 / V3.1: <|tool▁calls▁begin|>...<|tool▁sep|>... - GLM 4.5 / 4.6 / 4.7: <tool_call>NAME\n<arg_key>K</arg_key>\n <arg_value>V</arg_value>...</tool_call> - Kimi K2 / Moonshot: <|tool_calls_section_begin|>...<|tool_call_ argument_begin|>... Ported from llama.cpp common/chat-parser.cpp lines 801-913, 1040-1052 (MIT), vLLM tool_parsers/ {deepseekv31, glm4_moe, kimi_k2}_tool_parser.py (Apache-2.0), and SGLang function_call/ {deepseekv31, glm4_moe, kimik2}_detector.py (Apache-2.0). CI multi-format probe extended from 9 to 13 fixtures so all four new families run on ubuntu / macos-14 / windows.
Throwaway staging PR to verify the upstream PR unslothai#5520 (Studio safetensors agentic tool loop) tests pass on real ubuntu-latest, macos-14, and windows-latest runners using CPU-only torch + transformers. Burns no minutes on the upstream repo and stays well under the 5-concurrent-Windows-runner cap.
What this contains
Tests run per matrix cell
Triggers
Once this is green, the upstream PR is safe to merge.