Skip to content

Fix tool call parsing, add tool outputs panel and UI improvements#4416

Merged
danielhanchen merged 23 commits into
mainfrom
studio-tool-elapsed-timer
Mar 18, 2026
Merged

Fix tool call parsing, add tool outputs panel and UI improvements#4416
danielhanchen merged 23 commits into
mainfrom
studio-tool-elapsed-timer

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

@danielhanchen danielhanchen commented Mar 18, 2026

Summary

  • Fix tool call XML parsing in llama.cpp backend -- the old greedy regex would truncate JSON arguments containing nested braces or code strings. Replaced with balanced-brace extraction and proper handling of optional closing tags that models frequently omit.
  • Support bare <function=...> tags without <tool_call> wrapper, and strip all tool markup from streamed content so raw XML never leaks into the chat UI.
  • Use a persistent ~/studio_sandbox/ working directory for Python and Bash tool execution so files written by one tool call are visible to the next.
  • Emit tool_start/tool_end SSE events from the backend and surface them in a new collapsible "Tool Outputs" panel below assistant messages, showing each tool call's input and output with copy buttons.
  • Add a copy button to reasoning/thinking blocks.
  • Add an elapsed seconds timer to the tool status pill so users can see how long a tool has been running.
  • Update project URLs in pyproject.toml (http to https, add docs link).

Changes

Backend

  • studio/backend/core/inference/llama_cpp.py -- Rewrite _parse_tool_calls_from_text to use balanced-brace JSON extraction. Handle bare <function=...> tags. Emit tool_start/tool_end events. Strip tool markup from streamed content tokens.
  • studio/backend/core/inference/tools.py -- Replace per-call TemporaryDirectory with a persistent ~/studio_sandbox/ working directory shared across tool calls.
  • studio/backend/routes/inference.py -- Forward tool_start/tool_end events through SSE stream.

Frontend

  • studio/frontend/src/components/assistant-ui/tool-outputs.tsx -- New collapsible panel component showing tool inputs/outputs with copy buttons.
  • studio/frontend/src/components/assistant-ui/reasoning.tsx -- Add copy button to reasoning blocks.
  • studio/frontend/src/components/assistant-ui/thread.tsx -- Wire up ToolOutputsGroup, add elapsed timer to tool status pill.
  • studio/frontend/src/features/chat/api/chat-adapter.ts -- Define ToolOutput interface, accumulate tool events during streaming, attach to message metadata.
  • studio/frontend/src/features/chat/api/chat-api.ts -- Parse and forward tool_start/tool_end SSE events.
  • pyproject.toml -- Fix project URLs.

Test plan

  • Load a GGUF model with tool support
  • Trigger web search and code execution tools
  • Verify tool call parsing works for JSON-style and XML-style tool calls, including those with missing closing tags
  • Verify tool markup does not appear in chat messages
  • Verify "Tool Outputs" collapsible panel appears after tool calls complete, with correct input/output
  • Verify copy buttons work on tool outputs and reasoning blocks
  • Verify elapsed timer counts up on the status pill and resets between tools
  • Verify files persist across Python/Bash tool calls within a session
  • npm run build passes with no TypeScript errors

Show a count-up seconds timer (0s, 1s, 2s, ...) next to the tool status
text in the composer area. Helps users gauge how long a tool call (web
search, code execution) has been running. Timer resets when a new tool
starts and disappears when all tools finish.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the user experience in Studio by providing immediate visual feedback on the duration of tool executions. By displaying an elapsed timer next to the tool status, users can now quickly gauge how long operations like web searches or code executions have been running, enhancing transparency and reducing perceived latency during interactive sessions.

Highlights

  • Real-time Tool Execution Timer: Introduced a count-up seconds timer that appears next to the tool status text within the pulsing pill in the Studio composer area.
  • Timer Behavior: The timer initiates at 0s when a tool (e.g., web search, code execution) starts, increments every second, resets to 0s for subsequent tools, and disappears when all tools complete.
  • Implementation Details: The timer functionality was added to the ToolStatusDisplay component using React's useEffect and useState hooks, and styled with tabular-nums for consistent digit width and opacity-60 for visual subtlety.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an elapsed timer to the tool status display, enhancing user feedback during tool operations. The implementation is straightforward, utilizing a useEffect hook. I've provided one suggestion to refactor the useEffect logic for improved clarity and to remove redundancy. The rest of the changes are solid.

Comment on lines +390 to +398
if (!toolStatus) {
setElapsed(0);
return;
}
setElapsed(0);
const interval = setInterval(() => {
setElapsed((prev) => prev + 1);
}, 1000);
return () => clearInterval(interval);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic inside this useEffect can be simplified to be more declarative. By first resetting the elapsed time, and then conditionally starting the interval only when toolStatus is truthy, you can avoid redundant state updates and make the code's intent clearer.

    setElapsed(0);
    if (toolStatus) {
      const interval = setInterval(() => {
        setElapsed((prev) => prev + 1);
      }, 1000);
      return () => clearInterval(interval);
    }

Backend:
- Rewrite tool call XML parser to use balanced-brace JSON extraction
  instead of greedy regex, fixing truncation on nested braces in
  code/JSON arguments
- Handle optional closing tags (</tool_call>, </function>, </parameter>)
  that models frequently omit
- Support bare <function=...> tags without <tool_call> wrapper
- Strip tool call markup from streamed content so raw XML never leaks
  into the chat UI
- Use a persistent ~/studio_sandbox/ working directory for tool
  execution so files persist across calls within a session
- Emit tool_start/tool_end SSE events so the frontend can display
  tool inputs and outputs

Frontend:
- Add collapsible "Tool Outputs" panel below assistant messages showing
  each tool call's input and output with copy buttons
- Add copy button to reasoning blocks
- Add elapsed timer to tool status pill
- Update project URLs in pyproject.toml (http -> https, add docs link)
@danielhanchen danielhanchen changed the title Add elapsed timer to tool status pill in Studio Fix tool call parsing, add tool outputs panel and UI improvements Mar 18, 2026
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d0241036cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1598 to +1600
if not tool_calls and (
"<tool_call>" in content_text or "<function=" in content_text
):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict fallback parsing to explicit tool-call markup

The fallback now treats any assistant content containing "<function=" as a tool call candidate, even when finish_reason is "stop". In practice, a normal answer that shows XML/code examples like <function=web_search> will be parsed and executed as a real tool invocation, which can trigger unintended web/python/terminal calls and strip user-visible text. This path should be gated to explicit tool-call contexts (e.g., <tool_call> wrapper or structured tool_calls) rather than arbitrary content.

Useful? React with 👍 / 👎.

Comment on lines +1242 to +1243
next_func = func_starts[idx + 1].start() if idx + 1 < len(func_starts) else len(content)
end_tag = re.search(r"</tool_call>", content[body_start:])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse function bodies without using raw <function= sentinels

Using the next raw <function= token as a hard body boundary breaks valid single-call payloads when that literal appears inside parameter text (for example Python code containing "<function=...>"). The current logic truncates the first tool's arguments and can fabricate an extra tool call from the in-band string, leading to corrupted execution and wrong tool invocations. Function boundaries need to be derived from actual markup structure, not unconstrained substring matches.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances tool call handling and user interface for tool outputs. It updates website URLs in pyproject.toml. The backend llama_cpp.py now robustly parses tool calls from text, including complex JSON and XML structures with optional closing tags, and introduces tool_start/tool_end events for better tracking, alongside improved stripping of tool call markup from content. The tools.py file now uses a persistent working directory for Python and Bash executions, allowing files to persist across tool calls. The inference.py route is updated to pass these new tool events to the frontend. On the frontend, new ToolOutputsGroup and ToolItemCard components are introduced to display detailed, collapsible tool inputs and outputs, complete with copy functionality. The reasoning.tsx component gains a copy button for reasoning text, and thread.tsx now shows an elapsed time for tool execution. Review comments suggest improving the workdir creation in tools.py to prevent race conditions, optimizing re.sub usage in llama_cpp.py for efficiency, and using requestAnimationFrame for smoother elapsed time updates in thread.tsx.

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines +33 to +37
global _persistent_workdir
if _persistent_workdir is None or not os.path.isdir(_persistent_workdir):
home = os.path.expanduser("~")
workdir = os.path.join(home, "studio_sandbox")
os.makedirs(workdir, exist_ok = True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Consider adding a check to ensure that the workdir is only created if it doesn't already exist. This can prevent potential race conditions if multiple threads or processes try to create the directory simultaneously.

        if _persistent_workdir is None or not os.path.isdir(_persistent_workdir):
            home = os.path.expanduser("~")
            workdir = os.path.join(home, "studio_sandbox")
            if not os.path.exists(workdir): # Add this check
                os.makedirs(workdir, exist_ok=True)
            _persistent_workdir = workdir

Comment on lines +1617 to +1620
flags = re.DOTALL,
)
# Strip unterminated <tool_call>... to end
content_text = re.sub(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using re.sub with re.DOTALL to remove potentially large blocks of text could be inefficient. Consider a more targeted approach if performance becomes an issue, perhaps by finding the start and end indices and slicing the string.

                    content_text = re.sub(
                        r"<tool_call>.*$",
                        "",
                        content_text,
                        flags = re.DOTALL,
                    ).strip()

Comment on lines +395 to +399
setElapsed(0);
const interval = setInterval(() => {
setElapsed((prev) => prev + 1);
}, 1000);
return () => clearInterval(interval);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using requestAnimationFrame instead of setInterval for updating the elapsed time. This can provide smoother updates and better performance, as it synchronizes with the browser's rendering pipeline.

    useEffect(() => {
        if (!toolStatus) {
            setElapsed(0);
            return;
        }
        setElapsed(0);
        const updateElapsed = () => {
            setElapsed((prev) => prev + 1);
        };
        let animationFrameId: number;
        const tick = () => {
            updateElapsed();
            animationFrameId = requestAnimationFrame(tick);
        };
        animationFrameId = requestAnimationFrame(tick);
        return () => cancelAnimationFrame(animationFrameId);
    }, [toolStatus]);

HTML code fences now render an interactive sandboxed iframe preview
below the syntax-highlighted code, similar to how SVG fences show
an image preview. The iframe uses sandbox="allow-scripts" to allow
JavaScript execution while blocking access to the parent page.

Includes a fullscreen toggle (enlarge/minimize button) that expands
the preview into a viewport overlay, dismissible via button, Escape
key, or backdrop click. A streaming placeholder prevents partial
HTML from rendering mid-stream.
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9975db655b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

_re_final.compile(r"<tool_call>.*?</tool_call>", _re_final.DOTALL),
_re_final.compile(r"<function=\w+>.*?</function>", _re_final.DOTALL),
_re_final.compile(r"<tool_call>.*$", _re_final.DOTALL),
_re_final.compile(r"<function=\w+>.*$", _re_final.DOTALL),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit tool-markup stripping to confirmed tool call blocks

The final streaming cleanup applies r"<function=\w+>.*$" (and the analogous <tool_call> pattern) to every assistant response after tools run, so any legitimate text that contains these literals is truncated from that point onward. In practice, answers that explain XML/examples like <function=foo> lose the remainder of the message. This should only strip markup when the segment is an actual tool-call payload, not any raw substring match.

Useful? React with 👍 / 👎.

Comment on lines +1265 to +1267
val = body[pm.end() :]
val = re.sub(r"\s*</parameter>\s*$", "", val)
arguments[pm.group(1)] = val.strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop bare-function parameter capture at closing function tag

In the bare <function=...> fallback path, single-parameter parsing takes everything from <parameter=...> to the end of body, and </function> is only removed if it is the final suffix. If the model emits a valid closing </function> followed by normal text, that closing tag and trailing prose are included in the tool argument, which produces malformed query/code/command values and incorrect tool execution.

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines +36 to +37
workdir = os.path.join(home, "studio_sandbox")
os.makedirs(workdir, exist_ok = True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate tool working directories per conversation

The new _get_workdir() hard-codes a single process-wide ~/studio_sandbox directory, and both Python and Bash tools now execute in that shared cwd. In multi-chat or multi-user deployments this leaks files across requests and creates cross-run filename collisions, so one conversation can read or overwrite another's artifacts. The workdir should be scoped (or cleaned) per session/conversation rather than shared for the server lifetime.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances tool execution and display capabilities. Backend changes include more robust parsing of tool calls (JSON and XML-style) from model outputs in llama_cpp.py, the introduction of tool_start and tool_end events for detailed tool execution tracking, and the implementation of a persistent working directory (~/studio_sandbox/) for Python and Bash tool executions in tools.py to maintain state across calls. Frontend improvements involve a new ToolOutputsGroup component to display detailed tool inputs and outputs in a collapsible UI, an elapsed time display for tool execution, and an HtmlPreview component for rendering HTML code fences within an iframe, complete with resizing and fullscreen options. Additionally, a copy button for reasoning text was added. A security concern was raised regarding the HtmlPreview component, which renders LLM-generated HTML in an iframe with sandbox="allow-scripts", potentially introducing cross-site scripting vulnerabilities; a more restrictive sandbox or HTML sanitization is recommended.

<iframe
ref={iframeRef}
srcDoc={srcDoc}
sandbox="allow-scripts"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The HtmlPreview component renders HTML generated by the language model inside an iframe with sandbox="allow-scripts". While the sandbox provides some protection, allowing scripts from an untrusted source (the LLM) to run can introduce security vulnerabilities. For example, a malicious script could attempt to make requests to localhost or other internal services, or perform other undesirable actions.

For enhanced security, consider using a more restrictive sandbox or sanitizing the HTML before rendering. If the ResizeObserver script is the only reason for allow-scripts, you could explore alternative ways to size the iframe that don't require executing arbitrary scripts from the LLM output. One option is to use a library like dompurify to sanitize the HTML, allowing only safe tags and attributes.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5bef5953f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +191 to +195
<iframe
ref={iframeRef}
srcDoc={srcDoc}
sandbox="allow-scripts"
style={{ width: "100%", height, border: "none", display: "block" }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent automatic script execution in HTML previews

HtmlPreview feeds untrusted assistant-generated HTML directly into srcDoc and enables sandbox="allow-scripts", so any <script> inside an html code fence runs automatically as soon as the message renders. Since model output can include prompt-injected or adversarial HTML, this allows arbitrary browser-side code execution (e.g., CPU-burning loops or unsolicited network beacons) without an explicit user opt-in. This should be gated behind a user action or rendered without script permissions.

Useful? React with 👍 / 👎.

>
<div className="mt-2 space-y-2">
{toolOutputs.map((tool, idx) => (
<ToolItemCard key={tool.toolCallId || idx} tool={tool} />
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use unique keys for tool output list items

The list key uses tool.toolCallId directly, but fallback tool-call parsing emits IDs like call_0 per parse cycle, so multi-step tool iterations can produce duplicate IDs within one message. Duplicate React keys can cause row state/content to be reused incorrectly (for example, collapsible state sticking to the wrong tool output) and may drop or merge entries unpredictably. Use a guaranteed-unique key per rendered entry (e.g., include idx or another monotonic suffix).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24c1e4a56c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1833 to +1835
for pat in _TOOL_CALL_PATTERNS:
text = pat.sub("", text)
return text.strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep streamed text monotonic after markup filtering

The streaming pipeline assumes each emitted content value is a monotonic prefix (new_text = cumulative[len(prev_text):] in studio/backend/routes/inference.py), but _strip_tool_markup can delete already-emitted characters once a partial tag becomes a full regex match and then trims again with .strip(). When a model streams tool markup across multiple chunks, this causes unrecoverable corruption (e.g., leaked partial tag prefixes or words glued together) because the proxy cannot retract prior bytes. Make the filtered cumulative text monotonic (or defer filtering until full blocks are known) before emitting it.

Useful? React with 👍 / 👎.

Comment on lines +102 to +104
? tool.input?.query as string ?? ""
: tool.toolName === "python"
? (tool.input?.code as string ?? "").split("\n")[0].slice(0, 60)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate tool input types before preview truncation

The preview subtitle assumes tool.input.query/code/command are strings and immediately uses string operations (.split/.slice) after TS-only casts. Tool arguments come from model output and are not guaranteed to be strings; if a malformed tool call sends an object/number here, rendering throws (or React receives a non-renderable object), which can break the entire assistant message UI. Guard these fields with typeof value === "string" and fallback text before truncation.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits March 18, 2026 12:51
Add three user-configurable tool call settings to the Studio Settings panel:

- Auto Heal Tool Calls: toggle to control fallback XML parsing of malformed
  tool calls from model output (default: on)
- Max Tool Calls Per Message: slider 0-40 + Max to cap tool call iterations
  per message (default: 10)
- Max Tool Call Duration: slider 1-30 minutes + Max to set per-tool-call
  execution timeout (default: 5 minutes)

All settings persist to localStorage and flow through the full stack:
frontend store -> API request -> Pydantic model -> route -> llama_cpp -> tools.
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to tool call handling and execution, including a refactored tool call parsing logic in the backend to better handle malformed model outputs, new parameters for auto_heal_tool_calls, max_tool_iterations, and tool_call_timeout, and the implementation of a persistent working directory for tool execution. The frontend now includes a new HtmlPreview component for rendering HTML code fences in an iframe, a copy button for reasoning text, and a ToolOutputsGroup component to display detailed tool execution inputs and outputs. However, the HtmlPreview component has a high-severity Cross-Site Scripting (XSS) vulnerability due to direct rendering of unsanitized HTML from the model, and a medium-severity information leakage vulnerability in its resizeScript due to insecure postMessage usage with a wildcard origin.

Comment on lines +115 to +200
function HtmlPreview({ source }: { source: string }) {
const iframeRef = useRef<HTMLIFrameElement>(null);
const [height, setHeight] = useState(HTML_PREVIEW_DEFAULT_HEIGHT);
const [enlarged, setEnlarged] = useState(false);

useEffect(() => {
const handler = (e: MessageEvent) => {
if (e.source !== iframeRef.current?.contentWindow) return;
if (typeof e.data?.htmlPreviewHeight === "number") {
setHeight(Math.min(Math.max(e.data.htmlPreviewHeight, 100), HTML_PREVIEW_MAX_HEIGHT));
}
};
window.addEventListener("message", handler);
return () => window.removeEventListener("message", handler);
}, []);

useEffect(() => {
if (!enlarged) return;
const handler = (e: KeyboardEvent) => {
if (e.key === "Escape") setEnlarged(false);
};
window.addEventListener("keydown", handler);
return () => window.removeEventListener("keydown", handler);
}, [enlarged]);

const resizeScript = `<script>new ResizeObserver(()=>{
parent.postMessage({htmlPreviewHeight:document.documentElement.scrollHeight},"*");
}).observe(document.documentElement);</script>`;

const srcDoc = source + resizeScript;

if (enlarged) {
return (
<>
<div className="mt-2 overflow-hidden rounded-lg border border-border" style={{ height }}>
{/* Placeholder keeps layout stable while overlay is shown */}
</div>
<div
className="fixed inset-0 z-50 flex flex-col bg-background/80 backdrop-blur-sm"
onClick={(e) => { if (e.target === e.currentTarget) setEnlarged(false); }}
>
<div className="flex items-center justify-end gap-2 px-4 py-2">
<button
type="button"
className="flex items-center gap-1.5 rounded-md border border-border bg-background px-3 py-1.5 text-sm text-muted-foreground transition-colors hover:bg-muted hover:text-foreground"
onClick={() => setEnlarged(false)}
title="Exit fullscreen (Esc)"
>
<Minimize2Icon className="size-4" />
Exit fullscreen
</button>
</div>
<div className="mx-4 mb-4 flex-1 overflow-hidden rounded-lg border border-border bg-background">
<iframe
ref={iframeRef}
srcDoc={srcDoc}
sandbox="allow-scripts"
style={{ width: "100%", height: "100%", border: "none", display: "block" }}
title="HTML preview"
/>
</div>
</div>
</>
);
}

return (
<div className="group/html-preview relative mt-2 overflow-hidden rounded-lg border border-border">
<button
type="button"
className="absolute top-2 right-2 z-10 rounded-md border border-border bg-background/80 p-1.5 text-muted-foreground opacity-0 transition-all hover:bg-muted hover:text-foreground group-hover/html-preview:opacity-100 supports-[backdrop-filter]:backdrop-blur"
onClick={() => setEnlarged(true)}
title="Enlarge preview"
>
<Maximize2Icon className="size-4" />
</button>
<iframe
ref={iframeRef}
srcDoc={srcDoc}
sandbox="allow-scripts"
style={{ width: "100%", height, border: "none", display: "block" }}
title="HTML preview"
/>
</div>
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The HtmlPreview component renders HTML content from the model directly into an iframe with sandbox="allow-scripts". This creates a potential Cross-Site Scripting (XSS) vulnerability. While the sandbox attribute without allow-same-origin provides some protection, a malicious model could still craft HTML with scripts that perform undesirable actions (e.g., redirects, phishing). Similar to how SVG content is sanitized with sanitizeSvg, you should implement HTML sanitization on the source before rendering it in the iframe to mitigate this risk.

}, [enlarged]);

const resizeScript = `<script>new ResizeObserver(()=>{
parent.postMessage({htmlPreviewHeight:document.documentElement.scrollHeight},"*");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The resizeScript uses parent.postMessage(..., "*"), which is insecure as it allows any website to receive the posted message if this content is embedded. This could lead to information leakage. It's recommended to specify a specific target origin instead of the wildcard *. You can achieve this by passing window.origin from the parent component into the script string.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfd5da1185

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/core/inference/tools.py Outdated
name: str, arguments: dict, cancel_event = None, timeout: int | None = None
) -> str:
"""Execute a tool by name with the given arguments. Returns result as a string."""
effective_timeout = timeout if timeout is not None else _EXEC_TIMEOUT
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the no-timeout sentinel for long-running tools

When a caller selects the new “Max Tool Call Duration” option, generate_chat_completion_with_tools() passes None to mean “no limit”, but execute_tool() immediately rewrites None back to _EXEC_TIMEOUT. As a result, Python/terminal jobs that legitimately run longer than 5 minutes are still killed even though the request explicitly asked for an unlimited timeout.

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines 104 to 105
if name == "web_search":
return _web_search(arguments.get("query", ""))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply the configured tool timeout to web searches too

The new tool_call_timeout setting is described and surfaced as a per-tool-call limit, but the web_search branch ignores effective_timeout entirely. If DDGS stalls or a user lowers the timeout to keep the agent responsive, web searches can still block the whole tool loop far past the configured limit while the other tools are capped.

Useful? React with 👍 / 👎.

danielhanchen and others added 3 commits March 18, 2026 13:11
- Use a sentinel to distinguish timeout=None (no limit) from the default
  (300s). Previously None was silently replaced with _EXEC_TIMEOUT.
- Pass the configured timeout to DDGS() for web searches so the setting
  applies uniformly to all tool types.
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to tool call parsing, tool execution management, and user interface feedback. The backend now features a more robust _parse_tool_calls_from_text function that correctly handles complex JSON arguments and various XML-style tool call formats, including optional closing tags. A persistent ~/studio_sandbox/ working directory ensures that files created by one tool call are accessible to subsequent calls, which is a crucial enhancement for multi-step tool execution. Frontend improvements include a new collapsible 'Tool Outputs' panel that displays detailed inputs and outputs for each tool call, an elapsed timer for tool status, and a copy button for reasoning blocks. Additionally, the markdown renderer now supports interactive HTML previews for code fences. New parameters for auto-healing tool calls, maximum tool iterations, and tool call timeouts have been integrated across the stack, providing greater control and flexibility. The changes are well-implemented, enhance both functionality and user experience, and demonstrate a thorough approach to improving the agentic capabilities of the system.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82ad6666e2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines +194 to +197
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
text = True,
cwd = tempfile.gettempdir(),
cwd = _get_workdir(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Place Python temp scripts in the sandbox workdir

The Python tool now launches with cwd = _get_workdir(), but the script is still created via tempfile.mkstemp(...) in the system temp directory. When Python runs a file, sys.path[0] points to the script directory (e.g. /tmp), not the working directory, so modules/files created in ~/studio_sandbox by earlier tool steps are no longer importable with import .... This breaks multi-step workflows that persist code between tool calls unless the temp script is written in the sandbox workdir (or PYTHONPATH is adjusted).

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits March 18, 2026 13:45
- Add ge=0 constraint to max_tool_calls_per_message (rejects negative values)
- Add ge=1 constraint to tool_call_timeout (minimum 1 second)
- Thread session_id from frontend through backend to tool execution
- Scope sandbox directories per conversation: ~/studio_sandbox/{thread_id}/
- Backwards compatible: API callers without session_id use ~/studio_sandbox/
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant enhancements to tool call parsing, UI feedback for tool execution, and persistent working directories for tools. The backend now features more robust parsing for both JSON and XML-style tool calls, effectively addressing issues with nested structures and optional closing tags. On the frontend, the addition of a dedicated 'Tool Outputs' panel, an elapsed time display for tool status, and copy buttons for reasoning blocks greatly improve the user experience and observability of tool interactions. Furthermore, the introduction of configurable settings for tool call healing, maximum iterations, and timeout, alongside session-specific working directories, adds valuable flexibility and control to the system.

Comment on lines +1185 to +1205
for m in re.finditer(r"<tool_call>\s*\{", content):
brace_start = m.end() - 1 # position of the opening {
depth, i = 0, brace_start
in_string = False
while i < len(content):
ch = content[i]
if in_string:
if ch == "\\" and i + 1 < len(content):
i += 2 # skip escaped character
continue
if ch == '"':
in_string = False
elif ch == '"':
in_string = True
elif ch == "{":
depth += 1
elif ch == "}":
depth -= 1
if depth == 0:
break
i += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual character-by-character iteration for balanced-brace JSON extraction is robust but makes the _parse_tool_calls_from_text function quite long and complex. Consider extracting this logic into a dedicated helper function to improve readability and maintainability. This would make the main function's flow easier to follow.

Comment on lines +1240 to +1250
next_func = (
func_starts[idx + 1].start()
if idx + 1 < len(func_starts)
else len(content)
)
end_tag = re.search(r"</tool_call>", content[body_start:])
if end_tag:
body_end = body_start + end_tag.start()
else:
body_end = len(content)
body_end = min(body_end, next_func)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for determining the body_end in the XML parsing (next_func, end_tag, min) is quite intricate. While it appears correct, it could benefit from additional comments explaining the priority or flow of these boundary conditions, or perhaps a small helper function to encapsulate this boundary calculation.


# Per-session working directories so each chat thread gets its own sandbox.
# Falls back to a shared ~/studio_sandbox/ for API callers without a session_id.
_workdirs: dict[str, str] = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _workdirs global dictionary is used to manage session-specific working directories. While session_id provides isolation, global mutable state can introduce complexities in managing the lifecycle of these directories. Consider adding a mechanism to clean up old entries from _workdirs when sessions are no longer active to prevent potential memory growth or stale directories.

<iframe
ref={iframeRef}
srcDoc={srcDoc}
sandbox="allow-scripts"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sandbox="allow-scripts" attribute is a good security measure for HTML previews. However, allow-scripts is a relatively permissive flag. If the HTML content is not fully trusted and does not strictly require JavaScript execution for its intended display, consider using a more restrictive set of sandbox flags (e.g., allow-same-origin if needed for styling, but without allow-scripts) to further reduce the attack surface.

const MAX_DISPLAY_LENGTH = 10000;
const COPY_RESET_MS = 2000;

function truncate(text: string): string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The MAX_DISPLAY_LENGTH for truncating tool output is set to 10,000 characters. While this prevents UI breakage from excessively long outputs, it might still be too long for a default display, especially on smaller screens. Consider a shorter default truncation (e.g., 500-1000 characters) with an explicit 'show more' or 'expand' option to improve readability and user experience.

input: toolEvent.arguments as Record<string, unknown>,
});
} else if (toolEvent.type === "tool_end") {
const key = (toolEvent.tool_call_id as string) || [...pendingTools.keys()].pop() || "";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback logic [...pendingTools.keys()].pop() || "" for tool_end events to find a tool_call_id is a bit indirect. If tool_call_id is expected to be consistently present in tool_end events when it was present in tool_start, this fallback might indicate an unexpected data flow. If tool_call_id can genuinely be missing for tool_end but a match is still desired, consider adding a comment explaining this specific edge case and why pop() is the chosen fallback.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 278d9482b2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +575 to +579
toolOutputs.push({
toolName: pending.toolName ?? (toolEvent.tool_name as string),
toolCallId: key,
input: pending.input ?? {},
result: toolEvent.result as string,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid streaming raw tool stdout/stderr into message metadata

toolOutputs.push(...) now copies every tool_end payload into assistant-message metadata, which the new UI renders and persists. Because the Python and terminal subprocesses are started without an env= override in studio/backend/core/inference/tools.py (see the subprocess.Popen(...) calls there), anything they print from host env vars or local files is now sent straight to the browser and saved client-side instead of staying inside the model/tool loop. In deployments where code tools can read sensitive host data, this is a real data-exposure regression even when the assistant's final answer would not have surfaced that output.

Useful? React with 👍 / 👎.

Comment on lines +39 to +43
if session_id:
workdir = os.path.join(home, "studio_sandbox", session_id)
else:
workdir = os.path.join(home, "studio_sandbox")
os.makedirs(workdir, exist_ok = True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clean up per-session sandbox directories after thread teardown

_get_workdir() now creates a persistent ~/studio_sandbox/<session_id> directory for every thread and caches it, but a repo-wide search for studio_sandbox / _workdirs only finds this creation path—there is no deletion when a chat is removed or expires. Since the Python and terminal tools can write arbitrary files there, long-lived Studio deployments will accumulate one directory per thread plus all generated artifacts until disk space or inodes are exhausted.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits March 18, 2026 14:10
- Split tool markup stripping into closed-only (mid-stream) and full
  (final flush) to prevent cumulative text from shrinking mid-stream
- Enforce monotonicity: only emit when cleaned text grows, so the
  proxy's delta logic (cumulative[len(prev_text):]) never breaks
- Place Python temp scripts in the sandbox workdir instead of /tmp so
  sys.path[0] points to the sandbox and cross-call imports work
@danielhanchen
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad4aa5c564

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines +39 to +40
if session_id:
workdir = os.path.join(home, "studio_sandbox", session_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Sanitize session_id before using it as a filesystem path

session_id is now a request field, but _get_workdir() feeds it straight into os.path.join(home, "studio_sandbox", session_id). An API caller can pass values like "../../outside" or an absolute path such as "/tmp/foo", and both the Python and terminal tools will then create/run files in that resolved directory instead of staying under ~/studio_sandbox. That turns the new per-session sandboxing into a path-traversal escape for any deployment that exposes tool calls to untrusted clients.

Useful? React with 👍 / 👎.

Comment on lines +1056 to +1057
max_tool_iterations = payload.max_tool_calls_per_message
if payload.max_tool_calls_per_message is not None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce the new tool-call limit on actual calls

This request setting is surfaced as max_tool_calls_per_message, but in the GGUF path it is forwarded directly to max_tool_iterations. In generate_chat_completion_with_tools(), each iteration still executes every entry in tool_calls, so a model response containing multiple tool calls can exceed the configured cap immediately (for example, setting the slider to 1 still runs all calls from the first assistant turn). Users relying on this new control to bound latency or tool usage will not get the limit they asked for.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces several enhancements to the tool execution functionality. It includes changes to the pyproject.toml file to update the homepage and documentation URLs. The llama_cpp.py file was modified to improve the parsing of tool calls from text, handling cases where closing tags are omitted. The generate_chat_completion_with_tools function in llama_cpp.py was updated to include auto-healing of tool calls, increase the maximum tool iterations, and add a timeout for tool calls. The tools.py file was modified to include per-session working directories for tool execution, add a timeout parameter to the execute_tool function, and update the timeout message to include the actual timeout value. The ChatCompletionRequest model in models/inference.py was updated to include fields for auto-healing tool calls, maximum tool calls per message, tool call timeout, and session ID. The gguf_generate_with_tools function in routes/inference.py was updated to include parameters for auto-healing tool calls, maximum tool iterations, tool call timeout, and session ID. The markdown-text.tsx file was modified to add support for HTML previews. The reasoning.tsx file was modified to add a copy button to the reasoning text. The thread.tsx file was modified to add a tool outputs group. The tool-outputs.tsx file was created to display the tool outputs. The chat-adapter.ts file was modified to include a ToolOutput interface and to accumulate tool start/end events for the tool outputs panel. The chat-api.ts file was modified to yield tool start/end events. The chat-settings-sheet.tsx file was modified to add auto-heal tool calls, max tool calls slider, and tool call timeout slider. The chat-runtime-store.ts file was modified to add auto-heal tool calls, max tool calls per message, and tool call timeout. The api.ts file was modified to include parameters for auto-healing tool calls, maximum tool calls per message, tool call timeout, and session ID. The regular expression in llama_cpp.py for stripping tool call markup is too greedy and may lead to incorrect stripping of content. The docstring for the execute_tool function in tools.py should be updated to reflect the new timeout and session_id parameters. The TemporaryDirectory context manager is no longer used in tools.py, so the try...finally block for cleaning up tmp_path is unnecessary. The temporary directory is no longer used in tools.py, so this entire try block can be simplified. The timeout message in tools.py should include the actual timeout value. The block of if statements in routes/inference.py can be simplified by using the or operator to provide default values.

I am having trouble creating individual review comments. Click here to see my feedback.

studio/backend/core/inference/llama_cpp.py (1688-1693)

medium

The regular expression r"<tool_call>.*$" will greedily match everything from the first <tool_call> tag to the end of the content, even if there are multiple <tool_call> tags. This could lead to incorrect stripping of content if the model generates multiple tool calls but the first one is unterminated. Consider using a non-greedy match or a more specific pattern to avoid this.

                        r"<tool_call>.*?$",
                        "",
                        content_text,
                        flags = re.DOTALL,

studio/backend/core/inference/tools.py (83-84)

medium

The docstring should be updated to reflect the new timeout and session_id parameters.

def execute_tool(name: str, arguments: dict, cancel_event = None,
                 timeout: int | None = _TIMEOUT_UNSET,
                 session_id: str | None = None) -> str:
    """Execute a tool by name with the given arguments. Returns result as a string.

    ``timeout``: int sets per-call limit in seconds, ``None`` means no limit,
    unset (default) uses ``_EXEC_TIMEOUT`` (300 s).
    ``session_id``: optional thread/session ID for per-conversation sandbox isolation.
    """

studio/backend/core/inference/tools.py (215-249)

medium

The TemporaryDirectory context manager is no longer used, so the try...finally block for cleaning up tmp_path is unnecessary. The temporary file is now created in a persistent working directory, so it should not be deleted after each execution.

    tmp_path = None
    workdir = _get_workdir(session_id)
    try:
        fd, tmp_path = tempfile.mkstemp(
            suffix = ".py", prefix = "studio_exec_", dir = workdir
        )
        with os.fdopen(fd, "w") as f:
            f.write(code)

        proc = subprocess.Popen(
            [sys.executable, tmp_path],
            stdout = subprocess.PIPE,
            stderr = subprocess.STDOUT,
            text = True,
            cwd = workdir,
        )

        # Spawn cancel watcher if we have a cancel event
        if cancel_event is not None:
            watcher = threading.Thread(
                target = _cancel_watcher, args = (proc, cancel_event), daemon = True
            )
            watcher.start()

        try:
            output, _ = proc.communicate(timeout = timeout)
        except subprocess.TimeoutExpired:
            proc.kill()
            proc.communicate()
            return _truncate(f"Execution timed out after {timeout} seconds.")

        if cancel_event is not None and cancel_event.is_set():
            return "Execution cancelled."

        result = output or ""
        if proc.returncode != 0:
            result = f"Exit code {proc.returncode}:\n{result}"
        return _truncate(result) if result.strip() else "(no output)"

    except Exception as e:
        return f"Execution error: {e}"

studio/backend/core/inference/tools.py (217-224)

medium

The temporary directory is no longer used, so this entire try block can be simplified.

    try:
        proc = subprocess.Popen(
            ["bash", "-c", command],
            stdout = subprocess.PIPE,
            stderr = subprocess.STDOUT,
            text = True,
            cwd = tmpdir,
        )

studio/backend/core/inference/tools.py (234-238)

medium

The timeout message should include the actual timeout value.

                proc.kill()
                proc.communicate()
                return _truncate(f"Execution timed out after {timeout} seconds.")

studio/backend/core/inference/tools.py (238)

medium

The timeout message should include the actual timeout value.

            proc.kill()
            proc.communicate()
            return _truncate(f"Execution timed out after {timeout} seconds.")

studio/backend/core/inference/tools.py (291-295)

medium

The timeout message should include the actual timeout value.

            proc.kill()
            proc.communicate()
            return _truncate(f"Execution timed out after {timeout} seconds.")

studio/backend/routes/inference.py (1053-1062)

medium

This block of if statements can be simplified by using the or operator to provide default values.

                    enable_thinking = payload.enable_thinking,
                    auto_heal_tool_calls = payload.auto_heal_tool_calls if payload.auto_heal_tool_calls is not None else True,
                    max_tool_iterations = payload.max_tool_calls_per_message if payload.max_tool_calls_per_message is not None else 10,
                    tool_call_timeout = payload.tool_call_timeout if payload.tool_call_timeout is not None else 300,
                    session_id = payload.session_id,

Strip path separators and parent-dir references from session_id before
using it as a directory name. Verify the resolved path stays under
~/studio_sandbox/ as a second guard.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00de9c306e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tool_call_timeout = payload.tool_call_timeout
if payload.tool_call_timeout is not None
else 300,
session_id = payload.session_id,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bind sandbox session IDs to server-owned conversation IDs

session_id from the request body is forwarded directly into tool execution (generate_chat_completion_with_tools(..., session_id=payload.session_id)), and the tool layer uses that value to select the persistent sandbox directory. Because there is no server-side ownership check tying this ID to the authenticated caller/thread, a client can reuse another conversation’s ID and read or overwrite its persisted tool artifacts, which breaks isolation in multi-user deployments.

Useful? React with 👍 / 👎.

Comment on lines 1890 to +1894
if in_thinking:
if has_content_tokens:
cumulative += "</think>"
yield {"type": "content", "text": cumulative}
yield {
"type": "content",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run final tool-markup cleanup for non-thinking streams

The end-of-stream path only applies _strip_tool_markup(..., final=True) inside if in_thinking; when in_thinking is false, the function returns without a final cleanup pass. Since incremental stripping only removes closed blocks, any unterminated <tool_call> or <function=...> markup that appears at the end of a normal response can leak into user-visible output.

Useful? React with 👍 / 👎.

Replace custom metadata-based ToolOutputsGroup with native assistant-ui
tool-call content parts. Backend SSE tool_start/tool_end events now emit
proper { type: "tool-call" } parts from the adapter, enabling per-tool
UIs registered via tools.by_name in MessagePrimitive.Parts.

- Web search: Globe icon, Source badges with favicons, auto-collapse
  when LLM starts responding
- Python: Code icon, syntax-highlighted code via Streamdown/shiki,
  output block with copy
- Terminal: Terminal icon, command in trigger, output with copy
- ToolGroup wraps consecutive tool calls (skips for single calls)
- Sources component renders URL badges at end of message
- Flattened code block CSS (single border, no nested boxes)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if payload.enabled_tools:
tools_to_use = [
t
for t in ALL_TOOLS
if t["function"]["name"] in payload.enabled_tools
]
else:
tools_to_use = ALL_TOOLS

P1 Badge Respect an explicit empty enabled_tools allowlist

When enabled_tools is [], this truthiness check falls back to ALL_TOOLS, so a request that intentionally disables every tool still exposes the full tool set, including Python and terminal execution. The new per-tool allowlist needs to distinguish null/omitted from an explicit empty list; otherwise a client-side config bug or policy rollout can silently re-enable code-execution tools.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +45 to +47
<img
data-slot="source-icon"
src={`https://www.google.com/s2/favicons?domain=${domain}&sz=32`}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid leaking cited domains to Google's favicon service

Every source pill now triggers a browser request to www.google.com with the cited domain as soon as the message renders, before the user clicks anything. In self-hosted or privacy-sensitive deployments this leaks the generated source list and the viewer's IP to a third party, and it will also break under offline/CSP-restricted setups. A local fallback icon (or server-side proxy) would avoid introducing that new outbound dependency.

Useful? React with 👍 / 👎.

wasimysaid and others added 2 commits March 18, 2026 16:25
`if payload.enabled_tools:` is falsy for [], falling through to
ALL_TOOLS. Use `is not None` so an explicit empty list disables
all tools as intended.
@danielhanchen danielhanchen merged commit 9c95148 into main Mar 18, 2026
3 of 4 checks passed
@danielhanchen danielhanchen deleted the studio-tool-elapsed-timer branch March 18, 2026 15:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f585caa49b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +592 to +595
const id = (toolEvent.tool_call_id as string) ||
toolCallParts[toolCallParts.length - 1]?.toolCallId || "";
const idx = toolCallParts.findIndex((p) => p.toolCallId === id);
if (idx !== -1) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match tool_end to the latest duplicate tool call ID

When tool IDs are reused across iterations (the fallback parser in llama_cpp.py generates call_0, call_1, etc. per parse cycle), this lookup updates the first matching entry instead of the currently running one. In multi-step tool runs, a later tool_end can overwrite an earlier call’s result and leave the new call stuck in running state. This breaks tool-output attribution whenever duplicate tool_call_id values appear; the matcher should target the most recent unresolved call (e.g., last matching index) rather than the first match.

Useful? React with 👍 / 👎.

shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
…slothai#4416)

* Add elapsed timer to tool status pill in Studio

Show a count-up seconds timer (0s, 1s, 2s, ...) next to the tool status
text in the composer area. Helps users gauge how long a tool call (web
search, code execution) has been running. Timer resets when a new tool
starts and disappears when all tools finish.

* Fix tool call parsing, add tool outputs panel and reasoning copy button

Backend:
- Rewrite tool call XML parser to use balanced-brace JSON extraction
  instead of greedy regex, fixing truncation on nested braces in
  code/JSON arguments
- Handle optional closing tags (</tool_call>, </function>, </parameter>)
  that models frequently omit
- Support bare <function=...> tags without <tool_call> wrapper
- Strip tool call markup from streamed content so raw XML never leaks
  into the chat UI
- Use a persistent ~/studio_sandbox/ working directory for tool
  execution so files persist across calls within a session
- Emit tool_start/tool_end SSE events so the frontend can display
  tool inputs and outputs

Frontend:
- Add collapsible "Tool Outputs" panel below assistant messages showing
  each tool call's input and output with copy buttons
- Add copy button to reasoning blocks
- Add elapsed timer to tool status pill
- Update project URLs in pyproject.toml (http -> https, add docs link)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add interactive HTML preview with fullscreen toggle for code blocks

HTML code fences now render an interactive sandboxed iframe preview
below the syntax-highlighted code, similar to how SVG fences show
an image preview. The iframe uses sandbox="allow-scripts" to allow
JavaScript execution while blocking access to the parent page.

Includes a fullscreen toggle (enlarge/minimize button) that expands
the preview into a viewport overlay, dismissible via button, Escape
key, or backdrop click. A streaming placeholder prevents partial
HTML from rendering mid-stream.

* Add tool call settings: auto-heal toggle, max iterations, timeout

Add three user-configurable tool call settings to the Studio Settings panel:

- Auto Heal Tool Calls: toggle to control fallback XML parsing of malformed
  tool calls from model output (default: on)
- Max Tool Calls Per Message: slider 0-40 + Max to cap tool call iterations
  per message (default: 10)
- Max Tool Call Duration: slider 1-30 minutes + Max to set per-tool-call
  execution timeout (default: 5 minutes)

All settings persist to localStorage and flow through the full stack:
frontend store -> API request -> Pydantic model -> route -> llama_cpp -> tools.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tool call timeout: respect no-limit and apply to web search

- Use a sentinel to distinguish timeout=None (no limit) from the default
  (300s). Previously None was silently replaced with _EXEC_TIMEOUT.
- Pass the configured timeout to DDGS() for web searches so the setting
  applies uniformly to all tool types.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add input validation bounds and per-thread sandbox isolation

- Add ge=0 constraint to max_tool_calls_per_message (rejects negative values)
- Add ge=1 constraint to tool_call_timeout (minimum 1 second)
- Thread session_id from frontend through backend to tool execution
- Scope sandbox directories per conversation: ~/studio_sandbox/{thread_id}/
- Backwards compatible: API callers without session_id use ~/studio_sandbox/

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix non-monotonic streaming and Python temp script path

- Split tool markup stripping into closed-only (mid-stream) and full
  (final flush) to prevent cumulative text from shrinking mid-stream
- Enforce monotonicity: only emit when cleaned text grows, so the
  proxy's delta logic (cumulative[len(prev_text):]) never breaks
- Place Python temp scripts in the sandbox workdir instead of /tmp so
  sys.path[0] points to the sandbox and cross-call imports work

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Sanitize session_id to prevent path traversal in sandbox

Strip path separators and parent-dir references from session_id before
using it as a directory name. Verify the resolved path stays under
~/studio_sandbox/ as a second guard.

* feat(chat): proper assistant-ui tool call UIs with sources

Replace custom metadata-based ToolOutputsGroup with native assistant-ui
tool-call content parts. Backend SSE tool_start/tool_end events now emit
proper { type: "tool-call" } parts from the adapter, enabling per-tool
UIs registered via tools.by_name in MessagePrimitive.Parts.

- Web search: Globe icon, Source badges with favicons, auto-collapse
  when LLM starts responding
- Python: Code icon, syntax-highlighted code via Streamdown/shiki,
  output block with copy
- Terminal: Terminal icon, command in trigger, output with copy
- ToolGroup wraps consecutive tool calls (skips for single calls)
- Sources component renders URL badges at end of message
- Flattened code block CSS (single border, no nested boxes)

* fix(inference): respect empty enabled_tools allowlist

`if payload.enabled_tools:` is falsy for [], falling through to
ALL_TOOLS. Use `is not None` so an explicit empty list disables
all tools as intended.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Shine1i <wasimysdev@gmail.com>
danielhanchen added a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 17, 2026
Why the existing sandbox-hardening code is edited (FA3 rationale):

- `_sandbox_preexec` (tools.py:241) and its `_libc.prctl(PR_SET_NO_NEW_PRIVS)`
  block (tools.py:257-260) were introduced by unslothai#4827 / unslothai#5375 to harden tool
  subprocesses. That contract is preserved: the original function is renamed
  to `_sandbox_preexec_impl(apply_no_new_privs)` and a same-named
  `_sandbox_preexec()` wrapper calls it with `apply_no_new_privs=True`, so
  every direct caller still sets NNP. The new
  `_sandbox_preexec_for_bwrap()` skips NNP only on the Linux+bwrap path
  where bwrap itself reapplies NNP to the inner payload (setuid-helper
  bwrap installs cannot acquire helper privileges if NNP is set before
  execve). Net result: the inner sandboxed program still runs with NNP=1.

- The `if sys.platform != "win32": popen_kwargs["preexec_fn"] = ...` /
  `else: popen_kwargs["creationflags"] = subprocess.CREATE_NO_WINDOW`
  blocks in `_python_exec` (tools.py:1615-1618) and `_bash_exec`
  (tools.py:1709-1712) come from unslothai#4827. They are reorganised, not
  removed: the non-Windows path still sets `preexec_fn` and the Windows
  path still sets `CREATE_NO_WINDOW`. The only behavioural delta is the
  bwrap branch (Linux + sandbox_available()) selecting
  `_sandbox_preexec_for_bwrap` per above.

- `_workdirs[key] = workdir` (tools.py:348) from unslothai#4416 is replaced with
  `_workdirs[key] = os.path.realpath(workdir)`. Same caching behaviour;
  the realpath normalization is required because the bwrap argv binds
  `os.path.realpath(workdir)` while `tempfile.mkstemp(dir=workdir)`
  builds tmp_path from the non-realpath value, so on hosts where HOME
  or a parent is a symlink the script path passed via inner_argv is
  not reachable inside the sandbox namespace.

studio/backend/core/inference/sandbox.py
- Narrow Linux /etc bind from the whole directory to a runtime allowlist
  (hosts, resolv.conf, nsswitch.conf, localtime, ld.so.cache, ld.so.conf
  and ld.so.conf.d, ssl, ca-certificates, pki) so sandboxed tools cannot
  read sshd_config, machine-id, cloud-init data, etc.
- Narrow Seatbelt /private/etc the same way; replace the broad subpath
  with literal entries for hosts, resolv.conf, nsswitch.conf, localtime,
  protocols, services and subpaths for ssl and ca-certificates.
- Constrain Seatbelt allow process-exec to the same subpaths already
  used in file-map-executable so sandboxed Python cannot exec arbitrary
  system binaries such as osascript.
- Remove allow file-read and file-write on literal /dev/tty from the
  Seatbelt profile; tool output is captured via PIPE and there is no
  legitimate path that needs to bypass it.
- Gate site.getusersitepackages() behind
  UNSLOTH_STUDIO_SANDBOX_ALLOW_USER_SITE=1 so the sandbox no longer
  binds the real ~/.local/lib subtree by default (HOME is already
  rewritten to workdir for the child).
- Resolve _BWRAP_PROBE_BIN via shutil.which at module load with the
  /usr/bin/true fallback, so the probe works on Alpine and BusyBox
  layouts that place coreutils under /bin instead of /usr/bin.
- Replace the unreachable _linux_bwrap_path-or-bwrap fallback with an
  assertion that names the sandbox_available contract; same for the
  unreachable RuntimeError in build_sandbox_argv.

studio/backend/run.py
- Run sandbox_available on a background thread with a 2s bounded wait.
  On hardened kernels (unprivileged_userns_clone=0) the probe hangs
  for its full 5s timeout; running it inline delayed the access banner.
  The probe still runs (and caches) eagerly; if it has not reported
  back within 2s the banner is printed first and the unavailable notice
  is suppressed for this call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants