Studio: inject synthetic respond tool into agentic loop by danielhanchen · Pull Request #5706 · unslothai/unsloth

danielhanchen · 2026-05-22T14:29:42Z

Problem

In the GGUF and safetensors agentic loops, small local models (~8B) routinely emit bare assistant text mid-loop when the prompt called for a tool. The current _MAX_REPROMPTS=3 plan-without-action nudge tries to push them back into tool mode, but on models like Ministral-3-8B and Granite-4-Micro the nudge often fails and the loop times out at the iteration cap. The root cause is that there is no structured way for the model to emit "just text" while tools are active.

Fix

Inject a synthetic respond(message: str) tool into the agentic loop's tools list at loop entry. When the model calls respond, the loop strips the call from the response and emits message as plain assistant content with finish_reason="stop". From the client's perspective the response is normal text; the synthetic tool is invisible.

Injection is defensive:

Empty or None tools list: skipped (no behavior change).
The caller already supplied a tool named respond: skipped, the client's semantics win.
The input tools list is not mutated. A fresh list is constructed (list(tools) + [RESPOND_TOOL]) to avoid TOCTOU on lists that a caller might reuse across requests.

Wired into both generate_chat_completion_with_tools in core/inference/llama_cpp.py (GGUF / llama-server path) and run_safetensors_tool_loop in core/inference/safetensors_agentic.py (transformers path).

The unwrap path is:

Detect name == "respond" in the parsed tool call.
Pull message out of arguments, which may arrive as either a JSON string or an already-decoded dict.
Yield {"type": "status", "text": ""} so the SSE consumer's prev_text cursor resets (route handler at routes/inference.py:2531).
Yield {"type": "content", "text": message} so the message streams as fresh cumulative.
return, ending the agentic iteration. The route emits the closing finish_reason="stop" chunk.

Adapted from forge (https://github.com/antoinezambelli/forge, MIT). See ADR-013 in that repo for the rationale.

Test plan

Added 14 new tests in tests/test_safetensors_tool_loop.py:

TestRespondToolHelpers (9 cases):

Injection happens on a real tools list.
Empty list and None are no-ops.
Collision with a client-supplied respond tool is detected.
is_respond_call identifies the synthetic name.
extract_respond_message handles arguments arriving as JSON strings, dicts, malformed JSON, missing keys, and non-string values.

TestRespondToolUnwrap (4 cases):

A respond call emits the message as content and never calls execute_tool.
An empty message string still terminates cleanly.
A non-respond tool call goes through the normal execute path.
When the client supplies a real respond tool, the synthetic one is not injected and the client's tool runs normally.

Pass rate:

14/14 new cases pass.
57/57 in the full test_safetensors_tool_loop.py.
1705 passed, 4 skipped in studio/backend/tests/. The pre-existing flash-attn failure on main (test_training_worker_flash_attn.py::test_flash_linear_attention_installs_pinned_pair_for_qwen3_5) is unrelated and reproduces with this branch stashed.

Smoke: unsloth studio boots cleanly via bash install.sh --local with UNSLOTH_STUDIO_HOME set; the editable install picks up the edits live and /api/health returns healthy.

Backwards compatibility

Behavior is unchanged for:

Requests with no tools (the agentic loop never engages anyway).
Requests whose tools list already includes one named respond (the synthetic tool is not injected).
Models that emit real tool calls. The unwrap path only fires when name == "respond".

The plan-without-action nudge (_MAX_REPROMPTS=3) is intentionally left in place. It still helps in the rare case where a model emits planning prose without calling any tool, including the new respond. A follow-up may retire that slack budget once respond-injection is shown to fully cover the failure mode in production.

Small local models (~8B) routinely emit bare assistant text mid-loop when they should be calling a tool. Today that triggers the _MAX_REPROMPTS plan-without-action nudge and often times out. This patch gives the model a structured exit by injecting a synthetic respond(message: str) tool into the agentic loop's tools list. When the model calls respond, the loop strips the call from the response and emits message as plain assistant content with finish_reason=stop. Injection is defensive: * Empty / None tools list: skipped. * The caller already supplied a tool named "respond": skipped (the client's semantics win). * The input tools list is not mutated; a fresh list is constructed. Wired into both generate_chat_completion_with_tools (GGUF / llama-server) and run_safetensors_tool_loop (transformers). Adapted from forge (https://github.com/antoinezambelli/forge, MIT).

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request introduces a synthetic respond tool to provide a structured exit for assistant text in agentic loops within the llama_cpp and safetensors backends. The implementation includes logic to inject the tool and unwrap its calls to emit plain content. Review feedback identifies that the early return in llama_cpp.py when the respond tool is triggered skips yielding final metadata, such as token usage and timing metrics. Additionally, a suggestion was made to enhance the robustness of the inject_respond_tool function by adding type checks for tool definitions.

gemini-code-assist · 2026-05-22T14:33:22Z

+                    if _respond_injected and is_respond_call(tc):
+                        message = extract_respond_message(tc)
+                        yield {"type": "status", "text": ""}
+                        if message:
+                            yield {"type": "content", "text": message}
+                        return


The early return here skips the final metadata yield (usage and timings) that normally occurs at the end of the generate_chat_completion_with_tools function (lines 5331-5340). This will cause the request to report zero or missing token counts and performance metrics in the UI when the respond tool is used.

Consider yielding the accumulated metadata before returning, or refactoring the loop to allow a clean exit that reaches the final metadata block to ensure all computed values are returned for callers to reuse.

References

To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

gemini-code-assist · 2026-05-22T14:33:22Z

+def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]:
+    """Defensively append ``RESPOND_TOOL`` to ``tools``. Skips when the
+    list is empty/None or a tool named ``respond`` already exists.
+    Returns ``(new_tools, was_injected)``. The input list is not mutated.
+    """
+    if not tools:
+        return tools, False
+    for t in tools:
+        if (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME:
+            return tools, False
+    return list(tools) + [RESPOND_TOOL], True


The inject_respond_tool function assumes that tools is a list of dictionaries. While this is the expected type, adding a check to ensure each element t is actually a dictionary before calling .get() would make the function more robust against malformed input from external callers. This ensures that when merging tool definitions from multiple sources, we handle potential type mismatches gracefully without shadowing.

Suggested change

def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]:

"""Defensively append ``RESPOND_TOOL`` to ``tools``. Skips when the

list is empty/None or a tool named ``respond`` already exists.

Returns ``(new_tools, was_injected)``. The input list is not mutated.

"""

if not tools:

return tools, False

for t in tools:

if (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME:

return tools, False

return list(tools) + [RESPOND_TOOL], True

def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]:

"""Defensively append RESPOND_TOOL to tools. Skips when the

list is empty/None or a tool named respond already exists.

Returns (new_tools, was_injected). The input list is not mutated.

"""

if not tools:

return tools, False

for t in tools:

if isinstance(t, dict) and (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME:

return tools, False

return list(tools) + [RESPOND_TOOL], True

References

When multiple sources for enabling tools are present (e.g., protocol-specific requests and payload extensions), merge the sets of tools instead of allowing one to shadow the other.

To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

danielhanchen requested a review from rolandtannous as a code owner May 22, 2026 14:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

b1b2d01

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

oobabooga mentioned this pull request May 26, 2026

Studio: stop the model from replying twice when it refuses #5775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Studio: inject synthetic respond tool into agentic loop#5706

Studio: inject synthetic respond tool into agentic loop#5706
danielhanchen wants to merge 2 commits into
mainfrom
daniel/studio-respond-tool-injection

danielhanchen commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented May 22, 2026

Problem

Fix

Test plan

Backwards compatibility

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant