Studio: inject synthetic respond tool into agentic loop#5706
Studio: inject synthetic respond tool into agentic loop#5706danielhanchen wants to merge 2 commits into
Conversation
Small local models (~8B) routinely emit bare assistant text mid-loop when they should be calling a tool. Today that triggers the _MAX_REPROMPTS plan-without-action nudge and often times out. This patch gives the model a structured exit by injecting a synthetic respond(message: str) tool into the agentic loop's tools list. When the model calls respond, the loop strips the call from the response and emits message as plain assistant content with finish_reason=stop. Injection is defensive: * Empty / None tools list: skipped. * The caller already supplied a tool named "respond": skipped (the client's semantics win). * The input tools list is not mutated; a fresh list is constructed. Wired into both generate_chat_completion_with_tools (GGUF / llama-server) and run_safetensors_tool_loop (transformers). Adapted from forge (https://github.com/antoinezambelli/forge, MIT).
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request introduces a synthetic respond tool to provide a structured exit for assistant text in agentic loops within the llama_cpp and safetensors backends. The implementation includes logic to inject the tool and unwrap its calls to emit plain content. Review feedback identifies that the early return in llama_cpp.py when the respond tool is triggered skips yielding final metadata, such as token usage and timing metrics. Additionally, a suggestion was made to enhance the robustness of the inject_respond_tool function by adding type checks for tool definitions.
| if _respond_injected and is_respond_call(tc): | ||
| message = extract_respond_message(tc) | ||
| yield {"type": "status", "text": ""} | ||
| if message: | ||
| yield {"type": "content", "text": message} | ||
| return |
There was a problem hiding this comment.
The early return here skips the final metadata yield (usage and timings) that normally occurs at the end of the generate_chat_completion_with_tools function (lines 5331-5340). This will cause the request to report zero or missing token counts and performance metrics in the UI when the respond tool is used.
Consider yielding the accumulated metadata before returning, or refactoring the loop to allow a clean exit that reaches the final metadata block to ensure all computed values are returned for callers to reuse.
References
- To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.
| def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]: | ||
| """Defensively append ``RESPOND_TOOL`` to ``tools``. Skips when the | ||
| list is empty/None or a tool named ``respond`` already exists. | ||
| Returns ``(new_tools, was_injected)``. The input list is not mutated. | ||
| """ | ||
| if not tools: | ||
| return tools, False | ||
| for t in tools: | ||
| if (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME: | ||
| return tools, False | ||
| return list(tools) + [RESPOND_TOOL], True |
There was a problem hiding this comment.
The inject_respond_tool function assumes that tools is a list of dictionaries. While this is the expected type, adding a check to ensure each element t is actually a dictionary before calling .get() would make the function more robust against malformed input from external callers. This ensures that when merging tool definitions from multiple sources, we handle potential type mismatches gracefully without shadowing.
| def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]: | |
| """Defensively append ``RESPOND_TOOL`` to ``tools``. Skips when the | |
| list is empty/None or a tool named ``respond`` already exists. | |
| Returns ``(new_tools, was_injected)``. The input list is not mutated. | |
| """ | |
| if not tools: | |
| return tools, False | |
| for t in tools: | |
| if (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME: | |
| return tools, False | |
| return list(tools) + [RESPOND_TOOL], True | |
| def inject_respond_tool(tools: list[dict] | None) -> tuple[list[dict] | None, bool]: | |
| """Defensively append RESPOND_TOOL to tools. Skips when the | |
| list is empty/None or a tool named respond already exists. | |
| Returns (new_tools, was_injected). The input list is not mutated. | |
| """ | |
| if not tools: | |
| return tools, False | |
| for t in tools: | |
| if isinstance(t, dict) and (t.get("function") or {}).get("name") == RESPOND_TOOL_NAME: | |
| return tools, False | |
| return list(tools) + [RESPOND_TOOL], True |
References
- When multiple sources for enabling tools are present (e.g., protocol-specific requests and payload extensions), merge the sets of tools instead of allowing one to shadow the other.
- To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.
Problem
In the GGUF and safetensors agentic loops, small local models (~8B) routinely emit bare assistant text mid-loop when the prompt called for a tool. The current
_MAX_REPROMPTS=3plan-without-action nudge tries to push them back into tool mode, but on models like Ministral-3-8B and Granite-4-Micro the nudge often fails and the loop times out at the iteration cap. The root cause is that there is no structured way for the model to emit "just text" while tools are active.Fix
Inject a synthetic
respond(message: str)tool into the agentic loop'stoolslist at loop entry. When the model callsrespond, the loop strips the call from the response and emitsmessageas plain assistant content withfinish_reason="stop". From the client's perspective the response is normal text; the synthetic tool is invisible.Injection is defensive:
Nonetoolslist: skipped (no behavior change).respond: skipped, the client's semantics win.toolslist is not mutated. A fresh list is constructed (list(tools) + [RESPOND_TOOL]) to avoid TOCTOU on lists that a caller might reuse across requests.Wired into both
generate_chat_completion_with_toolsincore/inference/llama_cpp.py(GGUF / llama-server path) andrun_safetensors_tool_loopincore/inference/safetensors_agentic.py(transformers path).The unwrap path is:
name == "respond"in the parsed tool call.messageout ofarguments, which may arrive as either a JSON string or an already-decoded dict.{"type": "status", "text": ""}so the SSE consumer'sprev_textcursor resets (route handler atroutes/inference.py:2531).{"type": "content", "text": message}so the message streams as fresh cumulative.return, ending the agentic iteration. The route emits the closingfinish_reason="stop"chunk.Adapted from forge (https://github.com/antoinezambelli/forge, MIT). See ADR-013 in that repo for the rationale.
Test plan
Added 14 new tests in
tests/test_safetensors_tool_loop.py:TestRespondToolHelpers(9 cases):Noneare no-ops.respondtool is detected.is_respond_callidentifies the synthetic name.extract_respond_messagehandles arguments arriving as JSON strings, dicts, malformed JSON, missing keys, and non-string values.TestRespondToolUnwrap(4 cases):respondcall emits the message as content and never callsexecute_tool.messagestring still terminates cleanly.respondtool call goes through the normal execute path.respondtool, the synthetic one is not injected and the client's tool runs normally.Pass rate:
test_safetensors_tool_loop.py.studio/backend/tests/. The pre-existing flash-attn failure onmain(test_training_worker_flash_attn.py::test_flash_linear_attention_installs_pinned_pair_for_qwen3_5) is unrelated and reproduces with this branch stashed.Smoke:
unsloth studioboots cleanly viabash install.sh --localwithUNSLOTH_STUDIO_HOMEset; the editable install picks up the edits live and/api/healthreturns healthy.Backwards compatibility
Behavior is unchanged for:
toolslist already includes one namedrespond(the synthetic tool is not injected).name == "respond".The plan-without-action nudge (
_MAX_REPROMPTS=3) is intentionally left in place. It still helps in the rare case where a model emits planning prose without calling any tool, including the newrespond. A follow-up may retire that slack budget once respond-injection is shown to fully cover the failure mode in production.