Skip to content

[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows)#126

Open
danielhanchen wants to merge 14 commits into
mainfrom
safetensors-tool-loop-staging
Open

[staging] Safetensors tool loop CI smoke (ubuntu/macos/windows)#126
danielhanchen wants to merge 14 commits into
mainfrom
safetensors-tool-loop-staging

Conversation

@danielhanchen

Copy link
Copy Markdown
Owner

Throwaway staging PR to verify the upstream PR unslothai#5520 (Studio safetensors agentic tool loop) tests pass on real ubuntu-latest, macos-14, and windows-latest runners using CPU-only torch + transformers. Burns no minutes on the upstream repo and stays well under the 5-concurrent-Windows-runner cap.

What this contains

  • The 11 files PR-5520 changes (safetensors_agentic.py, tool_call_parser.py, chat_template_helpers.py, inference.py, llama_cpp.py, orchestrator.py, worker.py, routes/inference.py, tests/test_safetensors_tool_loop.py, utils/datasets/init.py, utils/datasets/model_mappings.py).
  • One new workflow .github/workflows/safetensors-tool-loop-ci.yml with a 3-cell OS matrix. Every other .github/workflows/ entry is removed so the fork's CI runs in minutes per push.

Tests run per matrix cell

  1. tests/test_safetensors_tool_loop.py -- 41 tests for parser shapes, state machine, allowlist guard, IPC kwarg forwarding, template fallback chain, prose containing literal `<tool_call>`, unique tool_call_id across iterations.
  2. Regression guard against test_openai_tool_passthrough, test_responses_tool_passthrough, test_inference_model_validation, test_anthropic_thinking_translation, test_anthropic_code_execution, test_anthropic_messages.

Triggers

  • push on this branch
  • pull_request limited via paths filter to studio/backend/core/inference/, studio/backend/routes/inference.py, studio/backend/tests/test_safetensors_tool_loop.py, studio/backend/utils/datasets/, and the workflow itself
  • workflow_dispatch
  • concurrency.cancel-in-progress so each new push supersedes the previous run

Once this is green, the upstream PR is safe to merge.

Bring the upstream PR-5520 changes into this staging fork and add one
focused CI workflow that runs the test_safetensors_tool_loop.py suite
plus the directly-adjacent tool / inference / anthropic regression
suites on ubuntu-latest, macos-14, and windows-latest under CPU-only
torch + transformers.

Scope:

- Copy the 11 files PR-5520 actually touches into the staging fork
  (core/inference/{tool_call_parser,safetensors_agentic,
  chat_template_helpers,inference,llama_cpp,orchestrator,worker}.py,
  routes/inference.py, tests/test_safetensors_tool_loop.py,
  utils/datasets/{__init__,model_mappings}.py).
- Drop every other .github/workflows/ file so the staging fork's CI
  budget stays well under the 5-concurrent-Windows-runner cap. Each
  push triggers exactly one workflow with three matrix cells.

Workflow shape:

- name: Safetensors tool loop CI
- matrix: ubuntu-latest, macos-14, windows-latest
- Python 3.11 across the board
- pip install CPU torch + transformers (matches Studio's existing
  backend-ci.yml dep shape) plus pytest, fastapi, pydantic, structlog,
  pyjwt, cryptography, python-multipart, etc.
- Steps:
  1. tests/test_safetensors_tool_loop.py (41 tests covering parser,
     state machine, allowlist, IPC kwarg forwarding, template fallback)
  2. Regression guard across openai_tool_passthrough,
     responses_tool_passthrough, inference_model_validation,
     anthropic_thinking_translation, anthropic_code_execution,
     anthropic_messages (~260 additional tests)
- paths filter limits trigger to the files this PR actually changes
  + the workflow itself, so unrelated commits do not re-run it.
- concurrency.cancel-in-progress: true so each new push supersedes
  the previous run.

CUDA spoof note: the safetensors loop never reaches model.generate;
the tests stub the single-turn cumulative generator. We install the
CPU torch wheel only because the studio/backend import chain
(utils.hardware) imports torch at module scope.
Comment thread studio/backend/routes/inference.py Fixed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements agentic tool-calling for safetensors models, mirroring the GGUF backend's functionality. It introduces a shared XML tool-call parser, a backend-neutral chat template helper with fallback logic for reasoning and tool parameters, and improves GGUF's offline model resolution and logging. Feedback identifies a discrepancy in zero-iteration handling between backends and suggests using sorted JSON keys for more robust duplicate tool call detection.

Comment on lines +148 to +152
if max_tool_iterations <= 0:
# why safe: documented 0 = disabled; mirror the GGUF loop which
# produces no iterations under the same setting.
yield {"type": "status", "text": ""}
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This explicit return for max_tool_iterations <= 0 creates a discrepancy with the GGUF tool loop implementation. In the GGUF path (llama_cpp.py), setting iterations to 0 still performs one final generation turn (the 'final answer' pass). Here, it produces no generation at all. Removing this guard would allow the loop to run once (iteration 0), which correctly performs a single turn with the 'final answer' nudge if max_tool_iterations is 0, ensuring parity across backends.

"arguments": arguments,
}

tc_key = tool_name + str(arguments)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using str(arguments) for duplicate detection is sensitive to dictionary key ordering. While modern Python preserves insertion order, LLMs may emit the same arguments in a different order across turns. Using json.dumps with sort_keys=True provides a more robust canonical key for duplicate detection.

Suggested change
tc_key = tool_name + str(arguments)
tc_key = f"{tool_name}|{json.dumps(arguments, sort_keys=True)}"

windows-latest defaults `run:` blocks to PowerShell, which rejects the
backslash line continuations used in the pip install steps and treats
`\` as a literal directory argument. Setting `shell: bash` matrix-wide
routes through Git Bash (preinstalled on the runner) so one script
shape works across ubuntu, macos, and windows.
…e/search pills enable

Mirrors the upstream studio-safetensors-tools 327c42c commit on the
staging branch so the cross-OS smoke workflow exercises the fix too.

Before this, the Web Search / Code Execution / Think pills stayed
permanently disabled for every safetensors model because the
orchestrator/worker IPC bridge never marshalled the resolved
tokenizer.chat_template back from the subprocess, so the route layer's
capability detector saw chat_template=None and advertised
supports_tools=False.
…ng fork

Mirrors upstream 6c92b61 onto the cross-OS staging branch:

1. Robust __IMAGES__ sentinel stripping (leading and consecutive
   sentinels) in safetensors_agentic.py.
2. Debug-log the gpt-oss override probe failure instead of swallowing.
3. Tighten the safetensors tool-stream and JSON tool-completion
   exception paths so a constant message goes over the wire and the
   detail stays in logger.exception (CWE-209 / CodeQL alerts 95/96).
4. Two new tests pinning the leading-sentinel and consecutive-
   sentinel edge cases.
…o fix

Adds a dedicated macos-14 job to the staging workflow that:
  - installs mlx / mlx-lm / mlx-vlm + Studio backend deps
  - loads unsloth/Qwen3.5-0.8B for real via MLXInferenceBackend
  - asserts the chat_template_info IPC payload now contains the
    template
  - asserts _detect_safetensors_features returns supports_tools=True
    and supports_reasoning=True

Mirrors the upstream MLXInferenceBackend._populate_chat_template_info
fix (commit b1b1623) so the staging branch exercises identical code.
Mirrors upstream dafad14 onto the staging branch (MLX kwarg fix) and
extends the macos-14 MLX job so it now exercises:

  1. backend.load_model("unsloth/Qwen3.5-0.8B") via MLXInferenceBackend
  2. chat_template_info populated on backend.models[name] (IPC contract)
  3. _detect_safetensors_features returns supports_tools=True +
     supports_reasoning=True (LoadResponse contract)
  4. backend.generate_chat_response actually accepts the four template
     kwargs and yields cumulative text across 6 cartesian cells:
       baseline / thinking_on / thinking_off / web_search_only /
       python_only / both_tools+thinking
  5. safetensors_agentic.run_safetensors_tool_loop drives the MLX
     stack end to end (fake executor stands in for the real
     web_search / python tools that require browser + sandbox)

Cell timeout 12-24 new tokens to keep CI fast; the goal is plumbing,
not benchmark quality.
The previous run proved the 6 cartesian cells succeed on real macos-14
MLX with the new kwarg signature:
  baseline / thinking_on / thinking_off / web_search_only /
  python_only / both_tools+thinking
all produced output, enable_thinking measurably flipped the model into
"Thinking Process:" mode, tools schema measurably grew the prompt
from 134 -> 1217+ chars.

The agentic loop step then hung past the 25-min job timeout because
mlx-vlm on a CPU-only macos-14 runner is too slow to complete even
24 new tokens reliably. Move the agentic-loop coverage back to the
unit tests (43 cells in test_safetensors_tool_loop.py, all green)
and keep the macos probe focused on the kwarg + chat_template_info
contracts.
The shared tool_call_parser used by safetensors and MLX now recognises
five emission families so the agentic loop sees the same call shape
llama-server normalises for GGUF:

  - Qwen / Hermes  <tool_call>{json}</tool_call>
  - Qwen3.5 / Hermes XML  <function=name><parameter=k>v
  - Llama-3 built-in tools  <|python_tag|>NAME.call(k="v", ...)
  - Llama-3 custom tools    <|python_tag|>{"name":..., "parameters":...}
  - Llama-3.2 bare JSON     {"name":..., "parameters":...}  (no tag)
  - Mistral v0.3 / Nemo / Small  [TOOL_CALLS] [{...}, ...]
  - Mistral v11+ / Magistral     [TOOL_CALLS]name{json}  (may chain)
  - Ministral / Large 3          [TOOL_CALLS]name[ARGS]{json}
  - Gemma 4                      <|tool_call>call:NAME{k:<|"|>v<|"|>}<tool_call|>

Parsers mirror the per-family regexes in llama.cpp's chat-parser.cpp
(legacy pre-PEG branch), vLLM's tool_parsers/, and SGLang's
function_call/ modules. Output is normalised to OpenAI shape:
{id, type:"function", function:{name, arguments(json_string)}}.

Truncated emissions (unclosed brackets, missing close tags) are
tolerated -- the parser walks balanced braces and falls back to per-
object healing.

Streaming buffer wakes up on five markers (was two) so the safetensors
/ MLX state machine drains tool calls instead of leaking them as prose:
TOOL_XML_SIGNALS now contains <tool_call>, <function=, <|python_tag|>,
[TOOL_CALLS], <|tool_call>.

CI: extends the staging cross-OS smoke with a multi-format parser
probe job that exercises every emission shape on ubuntu / macos-14 /
windows; runs the existing macos-14 MLX Qwen3.5-0.8B cartesian probe
unchanged.

Tests: 26 new unit tests covering each format (parser + streaming buffer
+ strip_tool_markup), plus 11 bare-JSON edge cases that guard against
false positives on plain prose / tool-message echoes.

Refs: llama.cpp commit 34df42f7be common/chat-parser.cpp;
vLLM main/tool_parsers/{llama,mistral,gemma}.py;
sglang main/function_call/{llama3,mistral,gemma}_format.py.
Mirrors the healing-parity follow-up commit from PR unslothai#5620:

1. GGUF BUFFERING tuple now uses the shared 5-format
   TOOL_XML_SIGNALS so non-Qwen tool emissions wake the state
   machine (was only 2 markers).
2. GGUF stream cleanup delegates to the shared strip_tool_markup
   so leaked markup from any family is removed from assistant
   content.
3. GGUF per-tool canonical heal key (python -> code, terminal
   -> command, * -> query) when arguments is a bare string.
4. Safetensors / MLX re-prompt on plan-without-action with
   _MAX_REPROMPTS=3 + extra iteration slots so re-prompts do
   not eat the tool budget.

Also pulls in core/tool_healing.py which staging was missing
(the legacy two-format helper module that llama_cpp.py imports
the regex constants from).
Mirrors PR unslothai#5624: three more emission-family parsers for the shared
tool_call_parser plus CI updates that exercise the new fixtures
cross-OS.

- DeepSeek R1 / V3 / V3.1: <|tool▁calls▁begin|>...<|tool▁sep|>...
- GLM 4.5 / 4.6 / 4.7: <tool_call>NAME\n<arg_key>K</arg_key>\n
  <arg_value>V</arg_value>...</tool_call>
- Kimi K2 / Moonshot: <|tool_calls_section_begin|>...<|tool_call_
  argument_begin|>...

Ported from llama.cpp common/chat-parser.cpp lines 801-913,
1040-1052 (MIT), vLLM tool_parsers/ {deepseekv31, glm4_moe,
kimi_k2}_tool_parser.py (Apache-2.0), and SGLang function_call/
{deepseekv31, glm4_moe, kimik2}_detector.py (Apache-2.0).

CI multi-format probe extended from 9 to 13 fixtures so all four new
families run on ubuntu / macos-14 / windows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants