Skip to content

Studio: inject synthetic respond tool into agentic loop#5706

Open
danielhanchen wants to merge 2 commits into
mainfrom
daniel/studio-respond-tool-injection
Open

Studio: inject synthetic respond tool into agentic loop#5706
danielhanchen wants to merge 2 commits into
mainfrom
daniel/studio-respond-tool-injection

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Problem

In the GGUF and safetensors agentic loops, small local models (~8B) routinely emit bare assistant text mid-loop when the prompt called for a tool. The current _MAX_REPROMPTS=3 plan-without-action nudge tries to push them back into tool mode, but on models like Ministral-3-8B and Granite-4-Micro the nudge often fails and the loop times out at the iteration cap. The root cause is that there is no structured way for the model to emit "just text" while tools are active.

Fix

Inject a synthetic respond(message: str) tool into the agentic loop's tools list at loop entry. When the model calls respond, the loop strips the call from the response and emits message as plain assistant content with finish_reason="stop". From the client's perspective the response is normal text; the synthetic tool is invisible.

Injection is defensive:

  • Empty or None tools list: skipped (no behavior change).
  • The caller already supplied a tool named respond: skipped, the client's semantics win.
  • The input tools list is not mutated. A fresh list is constructed (list(tools) + [RESPOND_TOOL]) to avoid TOCTOU on lists that a caller might reuse across requests.

Wired into both generate_chat_completion_with_tools in core/inference/llama_cpp.py (GGUF / llama-server path) and run_safetensors_tool_loop in core/inference/safetensors_agentic.py (transformers path).

The unwrap path is:

  1. Detect name == "respond" in the parsed tool call.
  2. Pull message out of arguments, which may arrive as either a JSON string or an already-decoded dict.
  3. Yield {"type": "status", "text": ""} so the SSE consumer's prev_text cursor resets (route handler at routes/inference.py:2531).
  4. Yield {"type": "content", "text": message} so the message streams as fresh cumulative.
  5. return, ending the agentic iteration. The route emits the closing finish_reason="stop" chunk.

Adapted from forge (https://github.com/antoinezambelli/forge, MIT). See ADR-013 in that repo for the rationale.

Test plan

Added 14 new tests in tests/test_safetensors_tool_loop.py:

TestRespondToolHelpers (9 cases):

  • Injection happens on a real tools list.
  • Empty list and None are no-ops.
  • Collision with a client-supplied respond tool is detected.
  • is_respond_call identifies the synthetic name.
  • extract_respond_message handles arguments arriving as JSON strings, dicts, malformed JSON, missing keys, and non-string values.

TestRespondToolUnwrap (4 cases):

  • A respond call emits the message as content and never calls execute_tool.
  • An empty message string still terminates cleanly.
  • A non-respond tool call goes through the normal execute path.
  • When the client supplies a real respond tool, the synthetic one is not injected and the client's tool runs normally.

Pass rate:

  • 14/14 new cases pass.
  • 57/57 in the full test_safetensors_tool_loop.py.
  • 1705 passed, 4 skipped in studio/backend/tests/. The pre-existing flash-attn failure on main (test_training_worker_flash_attn.py::test_flash_linear_attention_installs_pinned_pair_for_qwen3_5) is unrelated and reproduces with this branch stashed.

Smoke: unsloth studio boots cleanly via bash install.sh --local with UNSLOTH_STUDIO_HOME set; the editable install picks up the edits live and /api/health returns healthy.

Backwards compatibility

Behavior is unchanged for:

  • Requests with no tools (the agentic loop never engages anyway).
  • Requests whose tools list already includes one named respond (the synthetic tool is not injected).
  • Models that emit real tool calls. The unwrap path only fires when name == "respond".

The plan-without-action nudge (_MAX_REPROMPTS=3) is intentionally left in place. It still helps in the rare case where a model emits planning prose without calling any tool, including the new respond. A follow-up may retire that slack budget once respond-injection is shown to fully cover the failure mode in production.

Small local models (~8B) routinely emit bare assistant text mid-loop
when they should be calling a tool. Today that triggers the
_MAX_REPROMPTS plan-without-action nudge and often times out. This
patch gives the model a structured exit by injecting a synthetic
respond(message: str) tool into the agentic loop's tools list. When
the model calls respond, the loop strips the call from the response
and emits message as plain assistant content with finish_reason=stop.

Injection is defensive:

* Empty / None tools list: skipped.
* The caller already supplied a tool named "respond": skipped (the
  client's semantics win).
* The input tools list is not mutated; a fresh list is constructed.

Wired into both generate_chat_completion_with_tools (GGUF /
llama-server) and run_safetensors_tool_loop (transformers).

Adapted from forge (https://github.com/antoinezambelli/forge, MIT).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a synthetic respond tool to provide a structured exit for assistant text in agentic loops within the llama_cpp and safetensors backends. The implementation includes logic to inject the tool and unwrap its calls to emit plain content. Review feedback identifies that the early return in llama_cpp.py when the respond tool is triggered skips yielding final metadata, such as token usage and timing metrics. Additionally, a suggestion was made to enhance the robustness of the inject_respond_tool function by adding type checks for tool definitions.

Comment on lines +5017 to +5022
if _respond_injected and is_respond_call(tc):
message = extract_respond_message(tc)
yield {"type": "status", "text": ""}
if message:
yield {"type": "content", "text": message}
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The early return here skips the final metadata yield (usage and timings) that normally occurs at the end of the generate_chat_completion_with_tools function (lines 5331-5340). This will cause the request to report zero or missing token counts and performance metrics in the UI when the respond tool is used.

Consider yielding the accumulated metadata before returning, or refactoring the loop to allow a clean exit that reaches the final metadata block to ensure all computed values are returned for callers to reuse.

References
  1. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

Comment on lines +541 to +551
def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]:
"""Defensively append ``RESPOND_TOOL`` to ``tools``. Skips when the
list is empty/None or a tool named ``respond`` already exists.
Returns ``(new_tools, was_injected)``. The input list is not mutated.
"""
if not tools:
return tools, False
for t in tools:
if (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME:
return tools, False
return list(tools) + [RESPOND_TOOL], True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The inject_respond_tool function assumes that tools is a list of dictionaries. While this is the expected type, adding a check to ensure each element t is actually a dictionary before calling .get() would make the function more robust against malformed input from external callers. This ensures that when merging tool definitions from multiple sources, we handle potential type mismatches gracefully without shadowing.

Suggested change
def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]:
"""Defensively append ``RESPOND_TOOL`` to ``tools``. Skips when the
list is empty/None or a tool named ``respond`` already exists.
Returns ``(new_tools, was_injected)``. The input list is not mutated.
"""
if not tools:
return tools, False
for t in tools:
if (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME:
return tools, False
return list(tools) + [RESPOND_TOOL], True
def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]:
"""Defensively append RESPOND_TOOL to tools. Skips when the
list is empty/None or a tool named respond already exists.
Returns (new_tools, was_injected). The input list is not mutated.
"""
if not tools:
return tools, False
for t in tools:
if isinstance(t, dict) and (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME:
return tools, False
return list(tools) + [RESPOND_TOOL], True
References
  1. When multiple sources for enabling tools are present (e.g., protocol-specific requests and payload extensions), merge the sets of tools instead of allowing one to shadow the other.
  2. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant