Fix tool call parsing, add tool outputs panel and UI improvements#4416
Conversation
Show a count-up seconds timer (0s, 1s, 2s, ...) next to the tool status text in the composer area. Helps users gauge how long a tool call (web search, code execution) has been running. Timer resets when a new tool starts and disappears when all tools finish.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the user experience in Studio by providing immediate visual feedback on the duration of tool executions. By displaying an elapsed timer next to the tool status, users can now quickly gauge how long operations like web searches or code executions have been running, enhancing transparency and reducing perceived latency during interactive sessions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an elapsed timer to the tool status display, enhancing user feedback during tool operations. The implementation is straightforward, utilizing a useEffect hook. I've provided one suggestion to refactor the useEffect logic for improved clarity and to remove redundancy. The rest of the changes are solid.
| if (!toolStatus) { | ||
| setElapsed(0); | ||
| return; | ||
| } | ||
| setElapsed(0); | ||
| const interval = setInterval(() => { | ||
| setElapsed((prev) => prev + 1); | ||
| }, 1000); | ||
| return () => clearInterval(interval); |
There was a problem hiding this comment.
The logic inside this useEffect can be simplified to be more declarative. By first resetting the elapsed time, and then conditionally starting the interval only when toolStatus is truthy, you can avoid redundant state updates and make the code's intent clearer.
setElapsed(0);
if (toolStatus) {
const interval = setInterval(() => {
setElapsed((prev) => prev + 1);
}, 1000);
return () => clearInterval(interval);
}
Backend: - Rewrite tool call XML parser to use balanced-brace JSON extraction instead of greedy regex, fixing truncation on nested braces in code/JSON arguments - Handle optional closing tags (</tool_call>, </function>, </parameter>) that models frequently omit - Support bare <function=...> tags without <tool_call> wrapper - Strip tool call markup from streamed content so raw XML never leaks into the chat UI - Use a persistent ~/studio_sandbox/ working directory for tool execution so files persist across calls within a session - Emit tool_start/tool_end SSE events so the frontend can display tool inputs and outputs Frontend: - Add collapsible "Tool Outputs" panel below assistant messages showing each tool call's input and output with copy buttons - Add copy button to reasoning blocks - Add elapsed timer to tool status pill - Update project URLs in pyproject.toml (http -> https, add docs link)
|
/gemini review |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d0241036cf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not tool_calls and ( | ||
| "<tool_call>" in content_text or "<function=" in content_text | ||
| ): |
There was a problem hiding this comment.
Restrict fallback parsing to explicit tool-call markup
The fallback now treats any assistant content containing "<function=" as a tool call candidate, even when finish_reason is "stop". In practice, a normal answer that shows XML/code examples like <function=web_search> will be parsed and executed as a real tool invocation, which can trigger unintended web/python/terminal calls and strip user-visible text. This path should be gated to explicit tool-call contexts (e.g., <tool_call> wrapper or structured tool_calls) rather than arbitrary content.
Useful? React with 👍 / 👎.
| next_func = func_starts[idx + 1].start() if idx + 1 < len(func_starts) else len(content) | ||
| end_tag = re.search(r"</tool_call>", content[body_start:]) |
There was a problem hiding this comment.
Parse function bodies without using raw
<function= sentinels
Using the next raw <function= token as a hard body boundary breaks valid single-call payloads when that literal appears inside parameter text (for example Python code containing "<function=...>"). The current logic truncates the first tool's arguments and can fabricate an extra tool call from the in-band string, leading to corrupted execution and wrong tool invocations. Function boundaries need to be derived from actual markup structure, not unconstrained substring matches.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request significantly enhances tool call handling and user interface for tool outputs. It updates website URLs in pyproject.toml. The backend llama_cpp.py now robustly parses tool calls from text, including complex JSON and XML structures with optional closing tags, and introduces tool_start/tool_end events for better tracking, alongside improved stripping of tool call markup from content. The tools.py file now uses a persistent working directory for Python and Bash executions, allowing files to persist across tool calls. The inference.py route is updated to pass these new tool events to the frontend. On the frontend, new ToolOutputsGroup and ToolItemCard components are introduced to display detailed, collapsible tool inputs and outputs, complete with copy functionality. The reasoning.tsx component gains a copy button for reasoning text, and thread.tsx now shows an elapsed time for tool execution. Review comments suggest improving the workdir creation in tools.py to prevent race conditions, optimizing re.sub usage in llama_cpp.py for efficiency, and using requestAnimationFrame for smoother elapsed time updates in thread.tsx.
| global _persistent_workdir | ||
| if _persistent_workdir is None or not os.path.isdir(_persistent_workdir): | ||
| home = os.path.expanduser("~") | ||
| workdir = os.path.join(home, "studio_sandbox") | ||
| os.makedirs(workdir, exist_ok = True) |
There was a problem hiding this comment.
Consider adding a check to ensure that the workdir is only created if it doesn't already exist. This can prevent potential race conditions if multiple threads or processes try to create the directory simultaneously.
if _persistent_workdir is None or not os.path.isdir(_persistent_workdir):
home = os.path.expanduser("~")
workdir = os.path.join(home, "studio_sandbox")
if not os.path.exists(workdir): # Add this check
os.makedirs(workdir, exist_ok=True)
_persistent_workdir = workdir| flags = re.DOTALL, | ||
| ) | ||
| # Strip unterminated <tool_call>... to end | ||
| content_text = re.sub( |
There was a problem hiding this comment.
Using re.sub with re.DOTALL to remove potentially large blocks of text could be inefficient. Consider a more targeted approach if performance becomes an issue, perhaps by finding the start and end indices and slicing the string.
content_text = re.sub(
r"<tool_call>.*$",
"",
content_text,
flags = re.DOTALL,
).strip()| setElapsed(0); | ||
| const interval = setInterval(() => { | ||
| setElapsed((prev) => prev + 1); | ||
| }, 1000); | ||
| return () => clearInterval(interval); |
There was a problem hiding this comment.
Consider using requestAnimationFrame instead of setInterval for updating the elapsed time. This can provide smoother updates and better performance, as it synchronizes with the browser's rendering pipeline.
useEffect(() => {
if (!toolStatus) {
setElapsed(0);
return;
}
setElapsed(0);
const updateElapsed = () => {
setElapsed((prev) => prev + 1);
};
let animationFrameId: number;
const tick = () => {
updateElapsed();
animationFrameId = requestAnimationFrame(tick);
};
animationFrameId = requestAnimationFrame(tick);
return () => cancelAnimationFrame(animationFrameId);
}, [toolStatus]);
HTML code fences now render an interactive sandboxed iframe preview below the syntax-highlighted code, similar to how SVG fences show an image preview. The iframe uses sandbox="allow-scripts" to allow JavaScript execution while blocking access to the parent page. Includes a fullscreen toggle (enlarge/minimize button) that expands the preview into a viewport overlay, dismissible via button, Escape key, or backdrop click. A streaming placeholder prevents partial HTML from rendering mid-stream.
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9975db655b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _re_final.compile(r"<tool_call>.*?</tool_call>", _re_final.DOTALL), | ||
| _re_final.compile(r"<function=\w+>.*?</function>", _re_final.DOTALL), | ||
| _re_final.compile(r"<tool_call>.*$", _re_final.DOTALL), | ||
| _re_final.compile(r"<function=\w+>.*$", _re_final.DOTALL), |
There was a problem hiding this comment.
Limit tool-markup stripping to confirmed tool call blocks
The final streaming cleanup applies r"<function=\w+>.*$" (and the analogous <tool_call> pattern) to every assistant response after tools run, so any legitimate text that contains these literals is truncated from that point onward. In practice, answers that explain XML/examples like <function=foo> lose the remainder of the message. This should only strip markup when the segment is an actual tool-call payload, not any raw substring match.
Useful? React with 👍 / 👎.
| val = body[pm.end() :] | ||
| val = re.sub(r"\s*</parameter>\s*$", "", val) | ||
| arguments[pm.group(1)] = val.strip() |
There was a problem hiding this comment.
Stop bare-function parameter capture at closing function tag
In the bare <function=...> fallback path, single-parameter parsing takes everything from <parameter=...> to the end of body, and </function> is only removed if it is the final suffix. If the model emits a valid closing </function> followed by normal text, that closing tag and trailing prose are included in the tool argument, which produces malformed query/code/command values and incorrect tool execution.
Useful? React with 👍 / 👎.
| workdir = os.path.join(home, "studio_sandbox") | ||
| os.makedirs(workdir, exist_ok = True) |
There was a problem hiding this comment.
Isolate tool working directories per conversation
The new _get_workdir() hard-codes a single process-wide ~/studio_sandbox directory, and both Python and Bash tools now execute in that shared cwd. In multi-chat or multi-user deployments this leaks files across requests and creates cross-run filename collisions, so one conversation can read or overwrite another's artifacts. The workdir should be scoped (or cleaned) per session/conversation rather than shared for the server lifetime.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request significantly enhances tool execution and display capabilities. Backend changes include more robust parsing of tool calls (JSON and XML-style) from model outputs in llama_cpp.py, the introduction of tool_start and tool_end events for detailed tool execution tracking, and the implementation of a persistent working directory (~/studio_sandbox/) for Python and Bash tool executions in tools.py to maintain state across calls. Frontend improvements involve a new ToolOutputsGroup component to display detailed tool inputs and outputs in a collapsible UI, an elapsed time display for tool execution, and an HtmlPreview component for rendering HTML code fences within an iframe, complete with resizing and fullscreen options. Additionally, a copy button for reasoning text was added. A security concern was raised regarding the HtmlPreview component, which renders LLM-generated HTML in an iframe with sandbox="allow-scripts", potentially introducing cross-site scripting vulnerabilities; a more restrictive sandbox or HTML sanitization is recommended.
| <iframe | ||
| ref={iframeRef} | ||
| srcDoc={srcDoc} | ||
| sandbox="allow-scripts" |
There was a problem hiding this comment.
The HtmlPreview component renders HTML generated by the language model inside an iframe with sandbox="allow-scripts". While the sandbox provides some protection, allowing scripts from an untrusted source (the LLM) to run can introduce security vulnerabilities. For example, a malicious script could attempt to make requests to localhost or other internal services, or perform other undesirable actions.
For enhanced security, consider using a more restrictive sandbox or sanitizing the HTML before rendering. If the ResizeObserver script is the only reason for allow-scripts, you could explore alternative ways to size the iframe that don't require executing arbitrary scripts from the LLM output. One option is to use a library like dompurify to sanitize the HTML, allowing only safe tags and attributes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5bef5953f4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| <iframe | ||
| ref={iframeRef} | ||
| srcDoc={srcDoc} | ||
| sandbox="allow-scripts" | ||
| style={{ width: "100%", height, border: "none", display: "block" }} |
There was a problem hiding this comment.
Prevent automatic script execution in HTML previews
HtmlPreview feeds untrusted assistant-generated HTML directly into srcDoc and enables sandbox="allow-scripts", so any <script> inside an html code fence runs automatically as soon as the message renders. Since model output can include prompt-injected or adversarial HTML, this allows arbitrary browser-side code execution (e.g., CPU-burning loops or unsolicited network beacons) without an explicit user opt-in. This should be gated behind a user action or rendered without script permissions.
Useful? React with 👍 / 👎.
| > | ||
| <div className="mt-2 space-y-2"> | ||
| {toolOutputs.map((tool, idx) => ( | ||
| <ToolItemCard key={tool.toolCallId || idx} tool={tool} /> |
There was a problem hiding this comment.
Use unique keys for tool output list items
The list key uses tool.toolCallId directly, but fallback tool-call parsing emits IDs like call_0 per parse cycle, so multi-step tool iterations can produce duplicate IDs within one message. Duplicate React keys can cause row state/content to be reused incorrectly (for example, collapsible state sticking to the wrong tool output) and may drop or merge entries unpredictably. Use a guaranteed-unique key per rendered entry (e.g., include idx or another monotonic suffix).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24c1e4a56c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for pat in _TOOL_CALL_PATTERNS: | ||
| text = pat.sub("", text) | ||
| return text.strip() |
There was a problem hiding this comment.
Keep streamed text monotonic after markup filtering
The streaming pipeline assumes each emitted content value is a monotonic prefix (new_text = cumulative[len(prev_text):] in studio/backend/routes/inference.py), but _strip_tool_markup can delete already-emitted characters once a partial tag becomes a full regex match and then trims again with .strip(). When a model streams tool markup across multiple chunks, this causes unrecoverable corruption (e.g., leaked partial tag prefixes or words glued together) because the proxy cannot retract prior bytes. Make the filtered cumulative text monotonic (or defer filtering until full blocks are known) before emitting it.
Useful? React with 👍 / 👎.
| ? tool.input?.query as string ?? "" | ||
| : tool.toolName === "python" | ||
| ? (tool.input?.code as string ?? "").split("\n")[0].slice(0, 60) |
There was a problem hiding this comment.
Validate tool input types before preview truncation
The preview subtitle assumes tool.input.query/code/command are strings and immediately uses string operations (.split/.slice) after TS-only casts. Tool arguments come from model output and are not guaranteed to be strings; if a malformed tool call sends an object/number here, rendering throws (or React receives a non-renderable object), which can break the entire assistant message UI. Guard these fields with typeof value === "string" and fallback text before truncation.
Useful? React with 👍 / 👎.
Add three user-configurable tool call settings to the Studio Settings panel: - Auto Heal Tool Calls: toggle to control fallback XML parsing of malformed tool calls from model output (default: on) - Max Tool Calls Per Message: slider 0-40 + Max to cap tool call iterations per message (default: 10) - Max Tool Call Duration: slider 1-30 minutes + Max to set per-tool-call execution timeout (default: 5 minutes) All settings persist to localStorage and flow through the full stack: frontend store -> API request -> Pydantic model -> route -> llama_cpp -> tools.
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant improvements to tool call handling and execution, including a refactored tool call parsing logic in the backend to better handle malformed model outputs, new parameters for auto_heal_tool_calls, max_tool_iterations, and tool_call_timeout, and the implementation of a persistent working directory for tool execution. The frontend now includes a new HtmlPreview component for rendering HTML code fences in an iframe, a copy button for reasoning text, and a ToolOutputsGroup component to display detailed tool execution inputs and outputs. However, the HtmlPreview component has a high-severity Cross-Site Scripting (XSS) vulnerability due to direct rendering of unsanitized HTML from the model, and a medium-severity information leakage vulnerability in its resizeScript due to insecure postMessage usage with a wildcard origin.
| function HtmlPreview({ source }: { source: string }) { | ||
| const iframeRef = useRef<HTMLIFrameElement>(null); | ||
| const [height, setHeight] = useState(HTML_PREVIEW_DEFAULT_HEIGHT); | ||
| const [enlarged, setEnlarged] = useState(false); | ||
|
|
||
| useEffect(() => { | ||
| const handler = (e: MessageEvent) => { | ||
| if (e.source !== iframeRef.current?.contentWindow) return; | ||
| if (typeof e.data?.htmlPreviewHeight === "number") { | ||
| setHeight(Math.min(Math.max(e.data.htmlPreviewHeight, 100), HTML_PREVIEW_MAX_HEIGHT)); | ||
| } | ||
| }; | ||
| window.addEventListener("message", handler); | ||
| return () => window.removeEventListener("message", handler); | ||
| }, []); | ||
|
|
||
| useEffect(() => { | ||
| if (!enlarged) return; | ||
| const handler = (e: KeyboardEvent) => { | ||
| if (e.key === "Escape") setEnlarged(false); | ||
| }; | ||
| window.addEventListener("keydown", handler); | ||
| return () => window.removeEventListener("keydown", handler); | ||
| }, [enlarged]); | ||
|
|
||
| const resizeScript = `<script>new ResizeObserver(()=>{ | ||
| parent.postMessage({htmlPreviewHeight:document.documentElement.scrollHeight},"*"); | ||
| }).observe(document.documentElement);</script>`; | ||
|
|
||
| const srcDoc = source + resizeScript; | ||
|
|
||
| if (enlarged) { | ||
| return ( | ||
| <> | ||
| <div className="mt-2 overflow-hidden rounded-lg border border-border" style={{ height }}> | ||
| {/* Placeholder keeps layout stable while overlay is shown */} | ||
| </div> | ||
| <div | ||
| className="fixed inset-0 z-50 flex flex-col bg-background/80 backdrop-blur-sm" | ||
| onClick={(e) => { if (e.target === e.currentTarget) setEnlarged(false); }} | ||
| > | ||
| <div className="flex items-center justify-end gap-2 px-4 py-2"> | ||
| <button | ||
| type="button" | ||
| className="flex items-center gap-1.5 rounded-md border border-border bg-background px-3 py-1.5 text-sm text-muted-foreground transition-colors hover:bg-muted hover:text-foreground" | ||
| onClick={() => setEnlarged(false)} | ||
| title="Exit fullscreen (Esc)" | ||
| > | ||
| <Minimize2Icon className="size-4" /> | ||
| Exit fullscreen | ||
| </button> | ||
| </div> | ||
| <div className="mx-4 mb-4 flex-1 overflow-hidden rounded-lg border border-border bg-background"> | ||
| <iframe | ||
| ref={iframeRef} | ||
| srcDoc={srcDoc} | ||
| sandbox="allow-scripts" | ||
| style={{ width: "100%", height: "100%", border: "none", display: "block" }} | ||
| title="HTML preview" | ||
| /> | ||
| </div> | ||
| </div> | ||
| </> | ||
| ); | ||
| } | ||
|
|
||
| return ( | ||
| <div className="group/html-preview relative mt-2 overflow-hidden rounded-lg border border-border"> | ||
| <button | ||
| type="button" | ||
| className="absolute top-2 right-2 z-10 rounded-md border border-border bg-background/80 p-1.5 text-muted-foreground opacity-0 transition-all hover:bg-muted hover:text-foreground group-hover/html-preview:opacity-100 supports-[backdrop-filter]:backdrop-blur" | ||
| onClick={() => setEnlarged(true)} | ||
| title="Enlarge preview" | ||
| > | ||
| <Maximize2Icon className="size-4" /> | ||
| </button> | ||
| <iframe | ||
| ref={iframeRef} | ||
| srcDoc={srcDoc} | ||
| sandbox="allow-scripts" | ||
| style={{ width: "100%", height, border: "none", display: "block" }} | ||
| title="HTML preview" | ||
| /> | ||
| </div> | ||
| ); | ||
| } |
There was a problem hiding this comment.
The HtmlPreview component renders HTML content from the model directly into an iframe with sandbox="allow-scripts". This creates a potential Cross-Site Scripting (XSS) vulnerability. While the sandbox attribute without allow-same-origin provides some protection, a malicious model could still craft HTML with scripts that perform undesirable actions (e.g., redirects, phishing). Similar to how SVG content is sanitized with sanitizeSvg, you should implement HTML sanitization on the source before rendering it in the iframe to mitigate this risk.
| }, [enlarged]); | ||
|
|
||
| const resizeScript = `<script>new ResizeObserver(()=>{ | ||
| parent.postMessage({htmlPreviewHeight:document.documentElement.scrollHeight},"*"); |
There was a problem hiding this comment.
The resizeScript uses parent.postMessage(..., "*"), which is insecure as it allows any website to receive the posted message if this content is embedded. This could lead to information leakage. It's recommended to specify a specific target origin instead of the wildcard *. You can achieve this by passing window.origin from the parent component into the script string.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bfd5da1185
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| name: str, arguments: dict, cancel_event = None, timeout: int | None = None | ||
| ) -> str: | ||
| """Execute a tool by name with the given arguments. Returns result as a string.""" | ||
| effective_timeout = timeout if timeout is not None else _EXEC_TIMEOUT |
There was a problem hiding this comment.
Preserve the no-timeout sentinel for long-running tools
When a caller selects the new “Max Tool Call Duration” option, generate_chat_completion_with_tools() passes None to mean “no limit”, but execute_tool() immediately rewrites None back to _EXEC_TIMEOUT. As a result, Python/terminal jobs that legitimately run longer than 5 minutes are still killed even though the request explicitly asked for an unlimited timeout.
Useful? React with 👍 / 👎.
| if name == "web_search": | ||
| return _web_search(arguments.get("query", "")) |
There was a problem hiding this comment.
Apply the configured tool timeout to web searches too
The new tool_call_timeout setting is described and surfaced as a per-tool-call limit, but the web_search branch ignores effective_timeout entirely. If DDGS stalls or a user lowers the timeout to keep the agent responsive, web searches can still block the whole tool loop far past the configured limit while the other tools are capped.
Useful? React with 👍 / 👎.
- Use a sentinel to distinguish timeout=None (no limit) from the default (300s). Previously None was silently replaced with _EXEC_TIMEOUT. - Pass the configured timeout to DDGS() for web searches so the setting applies uniformly to all tool types.
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant improvements to tool call parsing, tool execution management, and user interface feedback. The backend now features a more robust _parse_tool_calls_from_text function that correctly handles complex JSON arguments and various XML-style tool call formats, including optional closing tags. A persistent ~/studio_sandbox/ working directory ensures that files created by one tool call are accessible to subsequent calls, which is a crucial enhancement for multi-step tool execution. Frontend improvements include a new collapsible 'Tool Outputs' panel that displays detailed inputs and outputs for each tool call, an elapsed timer for tool status, and a copy button for reasoning blocks. Additionally, the markdown renderer now supports interactive HTML previews for code fences. New parameters for auto-healing tool calls, maximum tool iterations, and tool call timeouts have been integrated across the stack, providing greater control and flexibility. The changes are well-implemented, enhance both functionality and user experience, and demonstrate a thorough approach to improving the agentic capabilities of the system.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82ad6666e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| stdout = subprocess.PIPE, | ||
| stderr = subprocess.STDOUT, | ||
| text = True, | ||
| cwd = tempfile.gettempdir(), | ||
| cwd = _get_workdir(), |
There was a problem hiding this comment.
Place Python temp scripts in the sandbox workdir
The Python tool now launches with cwd = _get_workdir(), but the script is still created via tempfile.mkstemp(...) in the system temp directory. When Python runs a file, sys.path[0] points to the script directory (e.g. /tmp), not the working directory, so modules/files created in ~/studio_sandbox by earlier tool steps are no longer importable with import .... This breaks multi-step workflows that persist code between tool calls unless the temp script is written in the sandbox workdir (or PYTHONPATH is adjusted).
Useful? React with 👍 / 👎.
- Add ge=0 constraint to max_tool_calls_per_message (rejects negative values)
- Add ge=1 constraint to tool_call_timeout (minimum 1 second)
- Thread session_id from frontend through backend to tool execution
- Scope sandbox directories per conversation: ~/studio_sandbox/{thread_id}/
- Backwards compatible: API callers without session_id use ~/studio_sandbox/
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements to tool call parsing, UI feedback for tool execution, and persistent working directories for tools. The backend now features more robust parsing for both JSON and XML-style tool calls, effectively addressing issues with nested structures and optional closing tags. On the frontend, the addition of a dedicated 'Tool Outputs' panel, an elapsed time display for tool status, and copy buttons for reasoning blocks greatly improve the user experience and observability of tool interactions. Furthermore, the introduction of configurable settings for tool call healing, maximum iterations, and timeout, alongside session-specific working directories, adds valuable flexibility and control to the system.
| for m in re.finditer(r"<tool_call>\s*\{", content): | ||
| brace_start = m.end() - 1 # position of the opening { | ||
| depth, i = 0, brace_start | ||
| in_string = False | ||
| while i < len(content): | ||
| ch = content[i] | ||
| if in_string: | ||
| if ch == "\\" and i + 1 < len(content): | ||
| i += 2 # skip escaped character | ||
| continue | ||
| if ch == '"': | ||
| in_string = False | ||
| elif ch == '"': | ||
| in_string = True | ||
| elif ch == "{": | ||
| depth += 1 | ||
| elif ch == "}": | ||
| depth -= 1 | ||
| if depth == 0: | ||
| break | ||
| i += 1 |
There was a problem hiding this comment.
The manual character-by-character iteration for balanced-brace JSON extraction is robust but makes the _parse_tool_calls_from_text function quite long and complex. Consider extracting this logic into a dedicated helper function to improve readability and maintainability. This would make the main function's flow easier to follow.
| next_func = ( | ||
| func_starts[idx + 1].start() | ||
| if idx + 1 < len(func_starts) | ||
| else len(content) | ||
| ) | ||
| end_tag = re.search(r"</tool_call>", content[body_start:]) | ||
| if end_tag: | ||
| body_end = body_start + end_tag.start() | ||
| else: | ||
| body_end = len(content) | ||
| body_end = min(body_end, next_func) |
There was a problem hiding this comment.
The logic for determining the body_end in the XML parsing (next_func, end_tag, min) is quite intricate. While it appears correct, it could benefit from additional comments explaining the priority or flow of these boundary conditions, or perhaps a small helper function to encapsulate this boundary calculation.
|
|
||
| # Per-session working directories so each chat thread gets its own sandbox. | ||
| # Falls back to a shared ~/studio_sandbox/ for API callers without a session_id. | ||
| _workdirs: dict[str, str] = {} |
There was a problem hiding this comment.
The _workdirs global dictionary is used to manage session-specific working directories. While session_id provides isolation, global mutable state can introduce complexities in managing the lifecycle of these directories. Consider adding a mechanism to clean up old entries from _workdirs when sessions are no longer active to prevent potential memory growth or stale directories.
| <iframe | ||
| ref={iframeRef} | ||
| srcDoc={srcDoc} | ||
| sandbox="allow-scripts" |
There was a problem hiding this comment.
The sandbox="allow-scripts" attribute is a good security measure for HTML previews. However, allow-scripts is a relatively permissive flag. If the HTML content is not fully trusted and does not strictly require JavaScript execution for its intended display, consider using a more restrictive set of sandbox flags (e.g., allow-same-origin if needed for styling, but without allow-scripts) to further reduce the attack surface.
| const MAX_DISPLAY_LENGTH = 10000; | ||
| const COPY_RESET_MS = 2000; | ||
|
|
||
| function truncate(text: string): string { |
There was a problem hiding this comment.
The MAX_DISPLAY_LENGTH for truncating tool output is set to 10,000 characters. While this prevents UI breakage from excessively long outputs, it might still be too long for a default display, especially on smaller screens. Consider a shorter default truncation (e.g., 500-1000 characters) with an explicit 'show more' or 'expand' option to improve readability and user experience.
| input: toolEvent.arguments as Record<string, unknown>, | ||
| }); | ||
| } else if (toolEvent.type === "tool_end") { | ||
| const key = (toolEvent.tool_call_id as string) || [...pendingTools.keys()].pop() || ""; |
There was a problem hiding this comment.
The fallback logic [...pendingTools.keys()].pop() || "" for tool_end events to find a tool_call_id is a bit indirect. If tool_call_id is expected to be consistently present in tool_end events when it was present in tool_start, this fallback might indicate an unexpected data flow. If tool_call_id can genuinely be missing for tool_end but a match is still desired, consider adding a comment explaining this specific edge case and why pop() is the chosen fallback.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 278d9482b2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| toolOutputs.push({ | ||
| toolName: pending.toolName ?? (toolEvent.tool_name as string), | ||
| toolCallId: key, | ||
| input: pending.input ?? {}, | ||
| result: toolEvent.result as string, |
There was a problem hiding this comment.
Avoid streaming raw tool stdout/stderr into message metadata
toolOutputs.push(...) now copies every tool_end payload into assistant-message metadata, which the new UI renders and persists. Because the Python and terminal subprocesses are started without an env= override in studio/backend/core/inference/tools.py (see the subprocess.Popen(...) calls there), anything they print from host env vars or local files is now sent straight to the browser and saved client-side instead of staying inside the model/tool loop. In deployments where code tools can read sensitive host data, this is a real data-exposure regression even when the assistant's final answer would not have surfaced that output.
Useful? React with 👍 / 👎.
| if session_id: | ||
| workdir = os.path.join(home, "studio_sandbox", session_id) | ||
| else: | ||
| workdir = os.path.join(home, "studio_sandbox") | ||
| os.makedirs(workdir, exist_ok = True) |
There was a problem hiding this comment.
Clean up per-session sandbox directories after thread teardown
_get_workdir() now creates a persistent ~/studio_sandbox/<session_id> directory for every thread and caches it, but a repo-wide search for studio_sandbox / _workdirs only finds this creation path—there is no deletion when a chat is removed or expires. Since the Python and terminal tools can write arbitrary files there, long-lived Studio deployments will accumulate one directory per thread plus all generated artifacts until disk space or inodes are exhausted.
Useful? React with 👍 / 👎.
- Split tool markup stripping into closed-only (mid-stream) and full (final flush) to prevent cumulative text from shrinking mid-stream - Enforce monotonicity: only emit when cleaned text grows, so the proxy's delta logic (cumulative[len(prev_text):]) never breaks - Place Python temp scripts in the sandbox workdir instead of /tmp so sys.path[0] points to the sandbox and cross-call imports work
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad4aa5c564
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if session_id: | ||
| workdir = os.path.join(home, "studio_sandbox", session_id) |
There was a problem hiding this comment.
Sanitize
session_id before using it as a filesystem path
session_id is now a request field, but _get_workdir() feeds it straight into os.path.join(home, "studio_sandbox", session_id). An API caller can pass values like "../../outside" or an absolute path such as "/tmp/foo", and both the Python and terminal tools will then create/run files in that resolved directory instead of staying under ~/studio_sandbox. That turns the new per-session sandboxing into a path-traversal escape for any deployment that exposes tool calls to untrusted clients.
Useful? React with 👍 / 👎.
| max_tool_iterations = payload.max_tool_calls_per_message | ||
| if payload.max_tool_calls_per_message is not None |
There was a problem hiding this comment.
Enforce the new tool-call limit on actual calls
This request setting is surfaced as max_tool_calls_per_message, but in the GGUF path it is forwarded directly to max_tool_iterations. In generate_chat_completion_with_tools(), each iteration still executes every entry in tool_calls, so a model response containing multiple tool calls can exceed the configured cap immediately (for example, setting the slider to 1 still runs all calls from the first assistant turn). Users relying on this new control to bound latency or tool usage will not get the limit they asked for.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
The pull request introduces several enhancements to the tool execution functionality. It includes changes to the pyproject.toml file to update the homepage and documentation URLs. The llama_cpp.py file was modified to improve the parsing of tool calls from text, handling cases where closing tags are omitted. The generate_chat_completion_with_tools function in llama_cpp.py was updated to include auto-healing of tool calls, increase the maximum tool iterations, and add a timeout for tool calls. The tools.py file was modified to include per-session working directories for tool execution, add a timeout parameter to the execute_tool function, and update the timeout message to include the actual timeout value. The ChatCompletionRequest model in models/inference.py was updated to include fields for auto-healing tool calls, maximum tool calls per message, tool call timeout, and session ID. The gguf_generate_with_tools function in routes/inference.py was updated to include parameters for auto-healing tool calls, maximum tool iterations, tool call timeout, and session ID. The markdown-text.tsx file was modified to add support for HTML previews. The reasoning.tsx file was modified to add a copy button to the reasoning text. The thread.tsx file was modified to add a tool outputs group. The tool-outputs.tsx file was created to display the tool outputs. The chat-adapter.ts file was modified to include a ToolOutput interface and to accumulate tool start/end events for the tool outputs panel. The chat-api.ts file was modified to yield tool start/end events. The chat-settings-sheet.tsx file was modified to add auto-heal tool calls, max tool calls slider, and tool call timeout slider. The chat-runtime-store.ts file was modified to add auto-heal tool calls, max tool calls per message, and tool call timeout. The api.ts file was modified to include parameters for auto-healing tool calls, maximum tool calls per message, tool call timeout, and session ID. The regular expression in llama_cpp.py for stripping tool call markup is too greedy and may lead to incorrect stripping of content. The docstring for the execute_tool function in tools.py should be updated to reflect the new timeout and session_id parameters. The TemporaryDirectory context manager is no longer used in tools.py, so the try...finally block for cleaning up tmp_path is unnecessary. The temporary directory is no longer used in tools.py, so this entire try block can be simplified. The timeout message in tools.py should include the actual timeout value. The block of if statements in routes/inference.py can be simplified by using the or operator to provide default values.
I am having trouble creating individual review comments. Click here to see my feedback.
studio/backend/core/inference/llama_cpp.py (1688-1693)
The regular expression r"<tool_call>.*$" will greedily match everything from the first <tool_call> tag to the end of the content, even if there are multiple <tool_call> tags. This could lead to incorrect stripping of content if the model generates multiple tool calls but the first one is unterminated. Consider using a non-greedy match or a more specific pattern to avoid this.
r"<tool_call>.*?$",
"",
content_text,
flags = re.DOTALL,studio/backend/core/inference/tools.py (83-84)
The docstring should be updated to reflect the new timeout and session_id parameters.
def execute_tool(name: str, arguments: dict, cancel_event = None,
timeout: int | None = _TIMEOUT_UNSET,
session_id: str | None = None) -> str:
"""Execute a tool by name with the given arguments. Returns result as a string.
``timeout``: int sets per-call limit in seconds, ``None`` means no limit,
unset (default) uses ``_EXEC_TIMEOUT`` (300 s).
``session_id``: optional thread/session ID for per-conversation sandbox isolation.
"""
studio/backend/core/inference/tools.py (215-249)
The TemporaryDirectory context manager is no longer used, so the try...finally block for cleaning up tmp_path is unnecessary. The temporary file is now created in a persistent working directory, so it should not be deleted after each execution.
tmp_path = None
workdir = _get_workdir(session_id)
try:
fd, tmp_path = tempfile.mkstemp(
suffix = ".py", prefix = "studio_exec_", dir = workdir
)
with os.fdopen(fd, "w") as f:
f.write(code)
proc = subprocess.Popen(
[sys.executable, tmp_path],
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
text = True,
cwd = workdir,
)
# Spawn cancel watcher if we have a cancel event
if cancel_event is not None:
watcher = threading.Thread(
target = _cancel_watcher, args = (proc, cancel_event), daemon = True
)
watcher.start()
try:
output, _ = proc.communicate(timeout = timeout)
except subprocess.TimeoutExpired:
proc.kill()
proc.communicate()
return _truncate(f"Execution timed out after {timeout} seconds.")
if cancel_event is not None and cancel_event.is_set():
return "Execution cancelled."
result = output or ""
if proc.returncode != 0:
result = f"Exit code {proc.returncode}:\n{result}"
return _truncate(result) if result.strip() else "(no output)"
except Exception as e:
return f"Execution error: {e}"studio/backend/core/inference/tools.py (217-224)
The temporary directory is no longer used, so this entire try block can be simplified.
try:
proc = subprocess.Popen(
["bash", "-c", command],
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
text = True,
cwd = tmpdir,
)studio/backend/core/inference/tools.py (234-238)
The timeout message should include the actual timeout value.
proc.kill()
proc.communicate()
return _truncate(f"Execution timed out after {timeout} seconds.")studio/backend/core/inference/tools.py (238)
The timeout message should include the actual timeout value.
proc.kill()
proc.communicate()
return _truncate(f"Execution timed out after {timeout} seconds.")
studio/backend/core/inference/tools.py (291-295)
The timeout message should include the actual timeout value.
proc.kill()
proc.communicate()
return _truncate(f"Execution timed out after {timeout} seconds.")
studio/backend/routes/inference.py (1053-1062)
This block of if statements can be simplified by using the or operator to provide default values.
enable_thinking = payload.enable_thinking,
auto_heal_tool_calls = payload.auto_heal_tool_calls if payload.auto_heal_tool_calls is not None else True,
max_tool_iterations = payload.max_tool_calls_per_message if payload.max_tool_calls_per_message is not None else 10,
tool_call_timeout = payload.tool_call_timeout if payload.tool_call_timeout is not None else 300,
session_id = payload.session_id,Strip path separators and parent-dir references from session_id before using it as a directory name. Verify the resolved path stays under ~/studio_sandbox/ as a second guard.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00de9c306e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tool_call_timeout = payload.tool_call_timeout | ||
| if payload.tool_call_timeout is not None | ||
| else 300, | ||
| session_id = payload.session_id, |
There was a problem hiding this comment.
Bind sandbox session IDs to server-owned conversation IDs
session_id from the request body is forwarded directly into tool execution (generate_chat_completion_with_tools(..., session_id=payload.session_id)), and the tool layer uses that value to select the persistent sandbox directory. Because there is no server-side ownership check tying this ID to the authenticated caller/thread, a client can reuse another conversation’s ID and read or overwrite its persisted tool artifacts, which breaks isolation in multi-user deployments.
Useful? React with 👍 / 👎.
| if in_thinking: | ||
| if has_content_tokens: | ||
| cumulative += "</think>" | ||
| yield {"type": "content", "text": cumulative} | ||
| yield { | ||
| "type": "content", |
There was a problem hiding this comment.
Run final tool-markup cleanup for non-thinking streams
The end-of-stream path only applies _strip_tool_markup(..., final=True) inside if in_thinking; when in_thinking is false, the function returns without a final cleanup pass. Since incremental stripping only removes closed blocks, any unterminated <tool_call> or <function=...> markup that appears at the end of a normal response can leak into user-visible output.
Useful? React with 👍 / 👎.
Replace custom metadata-based ToolOutputsGroup with native assistant-ui
tool-call content parts. Backend SSE tool_start/tool_end events now emit
proper { type: "tool-call" } parts from the adapter, enabling per-tool
UIs registered via tools.by_name in MessagePrimitive.Parts.
- Web search: Globe icon, Source badges with favicons, auto-collapse
when LLM starts responding
- Python: Code icon, syntax-highlighted code via Streamdown/shiki,
output block with copy
- Terminal: Terminal icon, command in trigger, output with copy
- ToolGroup wraps consecutive tool calls (skips for single calls)
- Sources component renders URL badges at end of message
- Flattened code block CSS (single border, no nested boxes)
There was a problem hiding this comment.
💡 Codex Review
unsloth/studio/backend/routes/inference.py
Lines 1031 to 1038 in f3def91
enabled_tools allowlist
When enabled_tools is [], this truthiness check falls back to ALL_TOOLS, so a request that intentionally disables every tool still exposes the full tool set, including Python and terminal execution. The new per-tool allowlist needs to distinguish null/omitted from an explicit empty list; otherwise a client-side config bug or policy rollout can silently re-enable code-execution tools.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| <img | ||
| data-slot="source-icon" | ||
| src={`https://www.google.com/s2/favicons?domain=${domain}&sz=32`} |
There was a problem hiding this comment.
Avoid leaking cited domains to Google's favicon service
Every source pill now triggers a browser request to www.google.com with the cited domain as soon as the message renders, before the user clicks anything. In self-hosted or privacy-sensitive deployments this leaks the generated source list and the viewer's IP to a third party, and it will also break under offline/CSP-restricted setups. A local fallback icon (or server-side proxy) would avoid introducing that new outbound dependency.
Useful? React with 👍 / 👎.
`if payload.enabled_tools:` is falsy for [], falling through to ALL_TOOLS. Use `is not None` so an explicit empty list disables all tools as intended.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f585caa49b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const id = (toolEvent.tool_call_id as string) || | ||
| toolCallParts[toolCallParts.length - 1]?.toolCallId || ""; | ||
| const idx = toolCallParts.findIndex((p) => p.toolCallId === id); | ||
| if (idx !== -1) { |
There was a problem hiding this comment.
Match tool_end to the latest duplicate tool call ID
When tool IDs are reused across iterations (the fallback parser in llama_cpp.py generates call_0, call_1, etc. per parse cycle), this lookup updates the first matching entry instead of the currently running one. In multi-step tool runs, a later tool_end can overwrite an earlier call’s result and leave the new call stuck in running state. This breaks tool-output attribution whenever duplicate tool_call_id values appear; the matcher should target the most recent unresolved call (e.g., last matching index) rather than the first match.
Useful? React with 👍 / 👎.
…slothai#4416) * Add elapsed timer to tool status pill in Studio Show a count-up seconds timer (0s, 1s, 2s, ...) next to the tool status text in the composer area. Helps users gauge how long a tool call (web search, code execution) has been running. Timer resets when a new tool starts and disappears when all tools finish. * Fix tool call parsing, add tool outputs panel and reasoning copy button Backend: - Rewrite tool call XML parser to use balanced-brace JSON extraction instead of greedy regex, fixing truncation on nested braces in code/JSON arguments - Handle optional closing tags (</tool_call>, </function>, </parameter>) that models frequently omit - Support bare <function=...> tags without <tool_call> wrapper - Strip tool call markup from streamed content so raw XML never leaks into the chat UI - Use a persistent ~/studio_sandbox/ working directory for tool execution so files persist across calls within a session - Emit tool_start/tool_end SSE events so the frontend can display tool inputs and outputs Frontend: - Add collapsible "Tool Outputs" panel below assistant messages showing each tool call's input and output with copy buttons - Add copy button to reasoning blocks - Add elapsed timer to tool status pill - Update project URLs in pyproject.toml (http -> https, add docs link) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add interactive HTML preview with fullscreen toggle for code blocks HTML code fences now render an interactive sandboxed iframe preview below the syntax-highlighted code, similar to how SVG fences show an image preview. The iframe uses sandbox="allow-scripts" to allow JavaScript execution while blocking access to the parent page. Includes a fullscreen toggle (enlarge/minimize button) that expands the preview into a viewport overlay, dismissible via button, Escape key, or backdrop click. A streaming placeholder prevents partial HTML from rendering mid-stream. * Add tool call settings: auto-heal toggle, max iterations, timeout Add three user-configurable tool call settings to the Studio Settings panel: - Auto Heal Tool Calls: toggle to control fallback XML parsing of malformed tool calls from model output (default: on) - Max Tool Calls Per Message: slider 0-40 + Max to cap tool call iterations per message (default: 10) - Max Tool Call Duration: slider 1-30 minutes + Max to set per-tool-call execution timeout (default: 5 minutes) All settings persist to localStorage and flow through the full stack: frontend store -> API request -> Pydantic model -> route -> llama_cpp -> tools. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix tool call timeout: respect no-limit and apply to web search - Use a sentinel to distinguish timeout=None (no limit) from the default (300s). Previously None was silently replaced with _EXEC_TIMEOUT. - Pass the configured timeout to DDGS() for web searches so the setting applies uniformly to all tool types. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add input validation bounds and per-thread sandbox isolation - Add ge=0 constraint to max_tool_calls_per_message (rejects negative values) - Add ge=1 constraint to tool_call_timeout (minimum 1 second) - Thread session_id from frontend through backend to tool execution - Scope sandbox directories per conversation: ~/studio_sandbox/{thread_id}/ - Backwards compatible: API callers without session_id use ~/studio_sandbox/ * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix non-monotonic streaming and Python temp script path - Split tool markup stripping into closed-only (mid-stream) and full (final flush) to prevent cumulative text from shrinking mid-stream - Enforce monotonicity: only emit when cleaned text grows, so the proxy's delta logic (cumulative[len(prev_text):]) never breaks - Place Python temp scripts in the sandbox workdir instead of /tmp so sys.path[0] points to the sandbox and cross-call imports work * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Sanitize session_id to prevent path traversal in sandbox Strip path separators and parent-dir references from session_id before using it as a directory name. Verify the resolved path stays under ~/studio_sandbox/ as a second guard. * feat(chat): proper assistant-ui tool call UIs with sources Replace custom metadata-based ToolOutputsGroup with native assistant-ui tool-call content parts. Backend SSE tool_start/tool_end events now emit proper { type: "tool-call" } parts from the adapter, enabling per-tool UIs registered via tools.by_name in MessagePrimitive.Parts. - Web search: Globe icon, Source badges with favicons, auto-collapse when LLM starts responding - Python: Code icon, syntax-highlighted code via Streamdown/shiki, output block with copy - Terminal: Terminal icon, command in trigger, output with copy - ToolGroup wraps consecutive tool calls (skips for single calls) - Sources component renders URL badges at end of message - Flattened code block CSS (single border, no nested boxes) * fix(inference): respect empty enabled_tools allowlist `if payload.enabled_tools:` is falsy for [], falling through to ALL_TOOLS. Use `is not None` so an explicit empty list disables all tools as intended. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Shine1i <wasimysdev@gmail.com>
Why the existing sandbox-hardening code is edited (FA3 rationale): - `_sandbox_preexec` (tools.py:241) and its `_libc.prctl(PR_SET_NO_NEW_PRIVS)` block (tools.py:257-260) were introduced by unslothai#4827 / unslothai#5375 to harden tool subprocesses. That contract is preserved: the original function is renamed to `_sandbox_preexec_impl(apply_no_new_privs)` and a same-named `_sandbox_preexec()` wrapper calls it with `apply_no_new_privs=True`, so every direct caller still sets NNP. The new `_sandbox_preexec_for_bwrap()` skips NNP only on the Linux+bwrap path where bwrap itself reapplies NNP to the inner payload (setuid-helper bwrap installs cannot acquire helper privileges if NNP is set before execve). Net result: the inner sandboxed program still runs with NNP=1. - The `if sys.platform != "win32": popen_kwargs["preexec_fn"] = ...` / `else: popen_kwargs["creationflags"] = subprocess.CREATE_NO_WINDOW` blocks in `_python_exec` (tools.py:1615-1618) and `_bash_exec` (tools.py:1709-1712) come from unslothai#4827. They are reorganised, not removed: the non-Windows path still sets `preexec_fn` and the Windows path still sets `CREATE_NO_WINDOW`. The only behavioural delta is the bwrap branch (Linux + sandbox_available()) selecting `_sandbox_preexec_for_bwrap` per above. - `_workdirs[key] = workdir` (tools.py:348) from unslothai#4416 is replaced with `_workdirs[key] = os.path.realpath(workdir)`. Same caching behaviour; the realpath normalization is required because the bwrap argv binds `os.path.realpath(workdir)` while `tempfile.mkstemp(dir=workdir)` builds tmp_path from the non-realpath value, so on hosts where HOME or a parent is a symlink the script path passed via inner_argv is not reachable inside the sandbox namespace. studio/backend/core/inference/sandbox.py - Narrow Linux /etc bind from the whole directory to a runtime allowlist (hosts, resolv.conf, nsswitch.conf, localtime, ld.so.cache, ld.so.conf and ld.so.conf.d, ssl, ca-certificates, pki) so sandboxed tools cannot read sshd_config, machine-id, cloud-init data, etc. - Narrow Seatbelt /private/etc the same way; replace the broad subpath with literal entries for hosts, resolv.conf, nsswitch.conf, localtime, protocols, services and subpaths for ssl and ca-certificates. - Constrain Seatbelt allow process-exec to the same subpaths already used in file-map-executable so sandboxed Python cannot exec arbitrary system binaries such as osascript. - Remove allow file-read and file-write on literal /dev/tty from the Seatbelt profile; tool output is captured via PIPE and there is no legitimate path that needs to bypass it. - Gate site.getusersitepackages() behind UNSLOTH_STUDIO_SANDBOX_ALLOW_USER_SITE=1 so the sandbox no longer binds the real ~/.local/lib subtree by default (HOME is already rewritten to workdir for the child). - Resolve _BWRAP_PROBE_BIN via shutil.which at module load with the /usr/bin/true fallback, so the probe works on Alpine and BusyBox layouts that place coreutils under /bin instead of /usr/bin. - Replace the unreachable _linux_bwrap_path-or-bwrap fallback with an assertion that names the sandbox_available contract; same for the unreachable RuntimeError in build_sandbox_argv. studio/backend/run.py - Run sandbox_available on a background thread with a 2s bounded wait. On hardened kernels (unprivileged_userns_clone=0) the probe hangs for its full 5s timeout; running it inline delayed the access banner. The probe still runs (and caches) eagerly; if it has not reported back within 2s the banner is printed first and the unavailable notice is suppressed for this call.
Summary
<function=...>tags without<tool_call>wrapper, and strip all tool markup from streamed content so raw XML never leaks into the chat UI.~/studio_sandbox/working directory for Python and Bash tool execution so files written by one tool call are visible to the next.tool_start/tool_endSSE events from the backend and surface them in a new collapsible "Tool Outputs" panel below assistant messages, showing each tool call's input and output with copy buttons.Changes
Backend
studio/backend/core/inference/llama_cpp.py-- Rewrite_parse_tool_calls_from_textto use balanced-brace JSON extraction. Handle bare<function=...>tags. Emittool_start/tool_endevents. Strip tool markup from streamed content tokens.studio/backend/core/inference/tools.py-- Replace per-callTemporaryDirectorywith a persistent~/studio_sandbox/working directory shared across tool calls.studio/backend/routes/inference.py-- Forwardtool_start/tool_endevents through SSE stream.Frontend
studio/frontend/src/components/assistant-ui/tool-outputs.tsx-- New collapsible panel component showing tool inputs/outputs with copy buttons.studio/frontend/src/components/assistant-ui/reasoning.tsx-- Add copy button to reasoning blocks.studio/frontend/src/components/assistant-ui/thread.tsx-- Wire upToolOutputsGroup, add elapsed timer to tool status pill.studio/frontend/src/features/chat/api/chat-adapter.ts-- DefineToolOutputinterface, accumulate tool events during streaming, attach to message metadata.studio/frontend/src/features/chat/api/chat-api.ts-- Parse and forwardtool_start/tool_endSSE events.pyproject.toml-- Fix project URLs.Test plan
npm run buildpasses with no TypeScript errors