studio: improve GGUF tool calling accuracy and reliability by danielhanchen · Pull Request #4700 · unslothai/unsloth

danielhanchen · 2026-03-30T13:39:49Z

Summary

Improves tool calling for GGUF models in Unsloth Studio, particularly for smaller models (4B) that struggle with multi-step agentic workflows. The core changes add URL fetching support to web_search, inject better behavioral guidance into the system prompt, and clean up XML artifacts that leak into responses.

Changes

URL fetching in web_search (tools.py)

Added a url parameter to the web_search tool so models can fetch full page content from URLs found in search results, instead of being limited to short snippets
Uses html2text for clean HTML-to-markdown conversion (headings, links, lists preserved), with regex-based fallback if html2text is not installed
Increased _MAX_PAGE_CHARS from 4,000 to 16,000 so fetched pages contain enough context for the model to extract structured data
Search results now include an explicit hint that snippets are truncated and full content is available via the url parameter

System prompt nudge (inference.py)

Injects the current date so models can judge recency of search results
Adds explicit guidance: fetch URLs from search results for full content, do not repeat the same query, use code execution for data processing
Adapts the nudge based on which tools are enabled (web only, code only, or both)

Error recovery nudge (llama_cpp.py)

When a tool result starts with an error prefix (e.g. "Error", "Search failed", "Blocked"), appends a recovery hint telling the model to try a different approach
Prevents small models from looping on the same broken tool call

XML cleanup (inference.py, llama_cpp.py)

Strips <tool_call>...</tool_call> XML from assistant messages in conversation history before sending to the model
Strips the same XML from the outgoing SSE stream so users never see raw tool-call markup

Max tool iterations (inference.py, llama_cpp.py, models/inference.py, frontend store)

Raised default from 10 to 25 across backend defaults, Pydantic model schema, and frontend TypeScript store

Test Results

Tested with unsloth/Qwen3.5-4B-GGUF (UD-Q4_K_XL), web search + code execution + thinking enabled. Prompt: "List and categorize all the songs that charted #3 on the Billboard Hot 100 in 2015." 10 runs per configuration.

Metric	Before	After
XML leaks in response	10/10	0/10
URL fetches used	0	4/10 runs
Runs with correct song names	0/10	2/10
Avg tool calls	5.5	3.8
Avg response time	12.3s	9.8s

When the model does use URL fetching, it works well -- the best run correctly identified all 4 songs that peaked at #3 (Love Me like You Do, Earned It, Watch Me, Drag Me Down) by fetching and parsing the full Wikipedia table.

The remaining accuracy gap is a fundamental small-model limitation: the 4B model often generates "let me fetch that page" as text output rather than actually emitting a tool call. Larger models (9B+) should see higher accuracy with the same infrastructure.

Files Changed

File	Changes
`studio/backend/core/inference/tools.py`	URL fetch support, html2text, _MAX_PAGE_CHARS 4k->16k, snippet hint
`studio/backend/routes/inference.py`	System prompt nudge (date + behavioral rules), XML stripping, max iterations 10->25
`studio/backend/core/inference/llama_cpp.py`	Error recovery nudge, XML stripping in agentic loop, max iterations 10->25
`studio/backend/models/inference.py`	Default max_tool_calls_per_message 10->25
`studio/frontend/.../chat-runtime-store.ts`	Frontend default max tool calls 10->25

- Add URL fetching to web_search tool so models can read full page content instead of only getting search snippets. Uses html2text for clean markdown conversion with regex fallback. - Inject current date and behavioral guidance (URL fetch workflow, no repeated queries, use code for data processing) into the tool-use system prompt. - Append error recovery nudge to tool results that indicate failure, helping small models avoid looping on the same broken call. - Strip leaked <tool_call> XML from assistant messages in conversation history and from the outgoing SSE stream. - Raise default max tool iterations from 10 to 25 across backend, model schema, and frontend defaults. - Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain enough content for the model to extract useful information. - Add "IMPORTANT: These are only short snippets" hint to search results so models know to fetch full pages when needed. Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after: - XML leaks in responses: 10/10 -> 0/10 - URL fetch usage: 0 -> 4/10 runs - Runs producing actual correct answers: 0/10 -> 2/10 - Average tool calls per query: 5.5 -> 3.8 (more efficient) - Average response time: 12.3s -> 9.8s

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68546d7aaf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-30T13:42:25Z

+    parsed = urlparse(url)
+    if parsed.scheme not in ("http", "https"):
+        return f"Blocked: only http/https URLs are allowed (got {parsed.scheme!r})."


Block internal hosts in URL fetcher

_fetch_page_text only validates the URL scheme, so web_search can fetch http(s) targets on loopback, link-local, or private networks (for example 127.0.0.1, localhost, 169.254.169.254, RFC1918 ranges). Because tool arguments come from model output, this introduces an SSRF path that can exfiltrate internal metadata/services in production deployments; host/IP allowlisting or private-range blocking is needed in addition to scheme checks.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-30T13:42:25Z

+    """
+    # Direct URL fetch mode
+    if url and url.strip():
+        return _fetch_page_text(url.strip(), timeout = min(timeout, 60))


Handle unlimited timeout in URL mode

When users set tool_call_timeout=9999 (documented as no limit), the caller passes timeout=None; this branch then executes min(timeout, 60), which raises TypeError for URL fetches before any network call. That means web_search with a url argument crashes the tool path under the documented unlimited-timeout setting.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-30T13:42:25Z

+        with urllib.request.urlopen(req, timeout = timeout) as resp:
+            raw_html = resp.read().decode("utf-8", errors = "replace")


Enforce response size before reading fetched pages

The function truncates text to _MAX_PAGE_CHARS only after resp.read() has already loaded the entire body into memory. Large or binary URLs can therefore consume excessive memory/CPU and destabilize inference workers despite the apparent page-size cap; read limits should be applied during download, not post-hoc.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request enhances the tool-use framework by adding a direct URL fetching capability to the web_search tool, increasing the default maximum tool iterations from 10 to 25, and implementing system prompt nudges to improve model performance. It also includes logic to sanitize conversation history and output streams by stripping leaked tool-call XML tags. Feedback was provided regarding the consistency of timeout values in the fetching utility and the placement of imports according to PEP 8 guidelines.

gemini-code-assist · 2026-03-30T13:43:58Z

+_MAX_PAGE_CHARS = 16000  # limit fetched page text
+
+
+def _fetch_page_text(


The _fetch_page_text function has a default timeout of 30 seconds. However, when called from _web_search, the timeout is explicitly set to min(timeout, 60). This means the effective timeout for _fetch_page_text will always be capped at 60 seconds, and its own default of 30 seconds will only apply if it's called directly, not through _web_search.

To improve clarity and avoid potential confusion, consider making the default timeout for _fetch_page_text consistent with its intended maximum usage (e.g., 60 seconds if that's the hard limit for web fetches) or add a constant for this maximum timeout.

gemini-code-assist · 2026-03-30T13:43:58Z

+            _has_web = "web_search" in _tool_names
+            _has_code = "python" in _tool_names or "terminal" in _tool_names
+
+            from datetime import date as _date


The import statement from datetime import date as _date is placed inside the openai_chat_completions function. It is generally considered best practice to place all standard library imports at the top of the file, outside of any function definitions. This improves readability and ensures that dependencies are declared clearly at the module level.

References

PEP 8: Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. ^(link)

Tested 16 configurations (4 models x 2 quants x 2 KV cache types) with 10 runs each on NVIDIA B200. Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4 correct songs, 0 XML leaks, 131s average response time.

danielhanchen · 2026-03-30T21:39:36Z

Benchmark Results: Full Cartesian Grid

Tested 16 configurations across 4 model sizes, 2 quantizations, and 2 KV cache types. 10 runs per config, NVIDIA B200.

Prompt: "List and categorize all the songs that charted #3 on the Billboard Hot 100 in 2015."

Ground truth: 4 songs peaked at #3 -- "Love Me like You Do", "Earned It", "Watch Me", "Drag Me Down".

Model	Quant	KV Cache	OK/10	Avg Time	Tools	XML Leaks	URL Fetch	Peak3 Avg	All 4/4	Best Songs
4B	UD-Q4_K_XL	f16	10/10	9.8s	3.5	0/10	4/10	0.8/4	2/10	9
4B	UD-Q4_K_XL	bf16	10/10	10.6s	4.5	0/10	4/10	0.4/4	1/10	5
4B	Q8_0	f16	10/10	4.9s	2.4	0/10	8/10	0.4/4	1/10	5
4B	Q8_0	bf16	10/10	8.0s	3.0	0/10	5/10	0.0/4	0/10	0
9B	UD-Q4_K_XL	f16	10/10	6.7s	2.0	0/10	5/10	0.0/4	0/10	3
9B	UD-Q4_K_XL	bf16	9/10	49.5s	2.4	0/10	5/10	0.0/4	0/10	1
9B	Q8_0	f16	10/10	7.4s	2.5	0/10	5/10	0.0/4	0/10	2
9B	Q8_0	bf16	10/10	10.4s	2.7	0/10	6/10	1.0/4	2/10	15
27B	UD-Q4_K_XL	bf16	9/10	131.1s	13.8	0/10	7/10	2.7/4	6/10	27
27B	UD-Q4_K_XL	f16	7/10	201.6s	14.1	0/10	8/10	2.0/4	5/10	26
27B	Q8_0	f16	4/10	312.5s	16.0	1/10	10/10	2.4/4	6/10	28
27B	Q8_0	bf16	5/10	258.4s	16.5	2/10	10/10	0.9/4	1/10	27
35B-A3B	UD-Q4_K_XL	f16	3/10	353.6s	14.7	1/10	6/10	1.2/4	3/10	27
35B-A3B	UD-Q4_K_XL	bf16	3/10	356.2s	17.2	1/10	8/10	1.6/4	4/10	27
35B-A3B	Q8_0	f16	2/10	372.1s	17.6	1/10	7/10	1.2/4	3/10	26
35B-A3B	Q8_0	bf16	6/10	267.7s	17.5	1/10	8/10	2.4/4	6/10	27

Key Takeaways

27B UD-Q4_K_XL + bf16 KV is the sweet spot -- 6/10 all-4/4 accuracy, 0 XML leaks, 131s average, 9/10 reliability.
Larger models use tools far more effectively. 27B/35B averaged 14-17 tool calls per query with multiple URL fetches, while 4B/9B averaged only 2-4 tool calls.
XML leaks are eliminated for small models (0/10) and near-zero for large models (0-2/10).
9B was surprisingly weaker than 4B on this task. It rarely extracted data from fetched pages. May need different prompting for complex multi-step retrieval.
35B-A3B had reliability issues -- most runs timed out due to slow per-token generation across many agentic iterations. When it completed, accuracy matched 27B.
bf16 KV cache helped 27B significantly -- 131s vs 202s, and 6/10 vs 5/10 accuracy. For other sizes the effect was mixed.

Full benchmark results are in studio/backend/tests/tool_calling_benchmark_results.md.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f60cea2f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-30T21:43:33Z

+                        cumulative = _TOOL_XML_RE.sub("", cumulative)
                        new_text = cumulative[len(prev_text) :]
                        prev_text = cumulative


Handle shrinking cumulative text after XML redaction

gguf_tool_stream computes deltas by slicing cumulative[len(prev_text):], but cumulative is now rewritten by _TOOL_XML_RE.sub(...) first. When a later chunk closes a previously streamed <tool_call>...</tool_call> span, the redacted cumulative text becomes shorter than prev_text, so the slice offset is wrong and subsequent user-visible text is dropped or garbled instead of streamed correctly. This shows up precisely in the XML-leak scenario this block is trying to recover from, and can also affect legitimate outputs that include that tag pattern.

Useful? React with 👍 / 👎.

When the model repeats the exact same tool call (same name + arguments) twice in a row, skip execution and return a redirect message telling it to try a different approach. This prevents the 8x-repeated-query loops observed on 27B and 35B models. When the tool iteration cap (25) is reached, inject a "provide your final answer now" message before the final streaming pass. This lets the model synthesize a useful answer from everything it gathered instead of being silently cut off. Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs): - Repeated query runs: 4/10 -> 2/10 - Cap hits: 1/10 -> 0/10 - All 4/4 accuracy: 5/10 -> 7/10

for more information, see https://pre-commit.ci

chatgpt-codex-connector · 2026-03-31T07:53:56Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf9deae02c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T07:57:00Z

+        conversation.append(
+            {
+                "role": "user",
+                "content": (
+                    "You have used all available tool calls. Based on "


Gate tool-cap nudge on actual cap exhaustion

This block always appends a synthetic user message after the loop, even when max_tool_iterations is 0 (documented as “disabled” via max_tool_calls_per_message). In that case no tool call happened, but the model still receives “You have used all available tool calls…”, which injects false context and can alter the final answer path for users intentionally disabling tools. Append this nudge only when the loop actually consumed the configured cap.

Useful? React with 👍 / 👎.

The regex fallback for HTML stripping did not match closing tags with whitespace before the angle bracket (e.g. </script >). Use \s* before > in both script and style patterns.

- SSRF: resolve hostname via getaddrinfo and reject private, loopback, link-local, multicast, and reserved addresses before fetching - Timeout: handle timeout=None (unlimited mode) in URL fetch path by defaulting to 60s instead of crashing on min(None, 60) - Download cap: read at most max_chars*4+1 bytes instead of the full response body before truncating - XML regex: match both <tool_call> and <function=...> markup in the history/stream cleanup (inference.py) - CodeQL: use [^>]* in closing script/style tags to handle any whitespace or attributes before > - Dedup: track whether each tool call failed so retries after transient errors are allowed; only block consecutive identical calls that both succeeded - Final-answer synthesis: guard on max_tool_iterations > 0 so callers who disable tools do not get a false "used all calls" turn

for more information, see https://pre-commit.ci

chatgpt-codex-connector · 2026-03-31T08:21:10Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

- SSRF redirect bypass: disable auto-redirect in urllib, manually follow up to 5 hops with host validation at each step. Prevents public URLs from redirecting to loopback/private targets. - SSE streaming: track prev_text on the raw cumulative and strip XML from the delta only, so completed tool_call tags do not cause the cumulative to shrink and drop trailing real text. - Dedup off-by-one: check the immediately previous call (window=1) instead of requiring 2 matching history entries, so the second identical successful call is blocked rather than the third.

for more information, see https://pre-commit.ci

chatgpt-codex-connector · 2026-03-31T08:38:23Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fec5f2cb00

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T08:42:47Z

+            def redirect_request(self, req, fp, code, msg, headers, newurl):
+                return None


Handle HTTP redirects without triggering urllib errors

Returning None from _NoRedirect.redirect_request causes urllib to treat 3xx responses as HTTPError via the default error handler, so _fetch_page_text never reaches the if resp.status in (301, ...) redirect branch. In practice, URLs that rely on normal redirects (for example http→https or canonical-path redirects) will fail with Failed to fetch URL instead of being followed through the intended validated-hop loop.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-31T08:42:47Z

+                current_url,
+                headers = {"User-Agent": "UnslothStudio/1.0"},
+            )
+            resp = opener.open(req, timeout = timeout)


Prevent DNS rebinding between host check and fetch

The private-range check is done on a preflight DNS lookup, but the subsequent opener.open(req, ...) call performs a fresh resolution when connecting. That leaves a DNS-rebinding window where an attacker-controlled hostname can resolve to a public IP during validation and then to a private/loopback IP during the actual request, bypassing the SSRF protection added here.

Useful? React with 👍 / 👎.

- Redirect fix: urllib raises HTTPError (not a normal response) when the redirect handler returns None. Catch HTTPError for 3xx codes and extract the Location header from the exception object. - Error prefixes: remove overly broad "No " prefix that matched "No results found." (a valid empty-search outcome, not an error). Replace with specific prefixes like "Blocked:", "No query provided", "Failed to resolve". This ensures empty search results are correctly classified as non-errors for duplicate-call tracking.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c80d30f4fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T08:58:43Z

+            for _msg in gguf_messages:
+                if _msg.get("role") == "assistant" and isinstance(
+                    _msg.get("content"), str
+                ):
+                    _msg["content"] = _TOOL_XML_RE.sub("", _msg["content"]).strip()


Respect auto-heal setting when stripping tool-call XML

This sanitization runs unconditionally in the tool path, so even when auto_heal_tool_calls is false the server still removes <tool_call>...</tool_call> / <function=...>...</function> content from assistant messages (and the same regex is later applied to streamed deltas). That breaks the advertised opt-out behavior and can silently corrupt legitimate outputs that include those tags as literal text, while also altering conversation history before the next model step.

Useful? React with 👍 / 👎.

danielhanchen · 2026-03-31T09:10:44Z

/gemini review

gemini-code-assist

Code Review

This pull request enhances tool-calling capabilities by increasing the iteration limit, adding duplicate call detection, and extending the web search tool to support direct URL fetching with SSRF protection. It also introduces system prompt nudges to guide model behavior, implements logic to strip leaked tool-call XML, and includes comprehensive benchmark results for GGUF tool calling. Review feedback identifies several optimization opportunities, including moving imports and regex compilations to the module level, removing an unused variable, and refactoring redundant logic to use existing helper functions.

gemini-code-assist · 2026-03-31T09:14:21Z

+        # where the model repeats the exact same call.  Retries after
+        # a transient failure are allowed (only block when the previous
+        # identical call succeeded).
+        import hashlib as _hl


Importing hashlib inside the function on every call is inefficient. This import should be moved to the top of the file to follow standard Python practices and improve performance.

References

Imports should be at the top of the file, after any module comments and docstrings, and before module globals and constants. ^(link)

gemini-code-assist · 2026-03-31T09:14:22Z

+            key = _tool_call_key(name, args)
+            _tool_call_history.append((key, failed))
+
+        _hit_tool_cap = False


The variable _hit_tool_cap is initialized but never used within the function. It should be removed to maintain code cleanliness.

gemini-code-assist · 2026-03-31T09:14:22Z

+                            content_accum = re.sub(
+                                r"<tool_call>.*?</tool_call>",
+                                "",
+                                content_accum,
+                                flags = re.DOTALL,
+                            ).strip()


This manual regex substitution is redundant and less comprehensive than the _strip_tool_markup helper function defined earlier (line 2160). Using the helper ensures that both \u003ctool_call\u003e and \u003cfunction=...\u003e tags are handled consistently, including cases where tags might be unclosed at the end of the stream. Additionally, _strip_tool_markup respects the auto_heal_tool_calls setting.

content_accum = _strip_tool_markup(content_accum, final = True)

gemini-code-assist · 2026-03-31T09:14:22Z

+            _has_web = "web_search" in _tool_names
+            _has_code = "python" in _tool_names or "terminal" in _tool_names
+
+            from datetime import date as _date


Importing date inside the request handler on every call is inefficient. This import should be moved to the top of the file.

References

Imports should be at the top of the file, after any module comments and docstrings, and before module globals and constants. ^(link)

gemini-code-assist · 2026-03-31T09:14:22Z

+            _TOOL_XML_RE = _re.compile(
+                r"<tool_call>.*?</tool_call>|<function=\w+>.*?</function>",
+                _re.DOTALL,
+            )


Compiling the _TOOL_XML_RE regular expression inside the request handler on every call is inefficient. It should be defined as a module-level constant to improve performance, especially given that this regex is also used within the streaming generator.

- SSE streaming: sanitize the full cumulative text before diffing against the previous sanitized snapshot, so XML tags that span chunk boundaries are stripped correctly. The previous delta-based approach leaked split tags. - DRAINING fallback: use _strip_tool_markup() helper instead of a manual regex that only handled <tool_call> but not <function=...>. - Move hashlib import, _TOOL_XML_RE compile, and datetime import to module level per style guide. - Remove unused _hit_tool_cap variable.

danielhanchen · 2026-03-31T09:41:40Z

/gemini review

gemini-code-assist

Code Review

This pull request enhances tool-calling capabilities by increasing the maximum tool iteration limit to 25, implementing duplicate tool-call detection using hashing, and adding a direct URL fetching feature to the web search tool with SSRF protection. It also introduces system prompt nudges to guide the model's tool usage, adds logic to sanitize leaked tool-call XML from the output, and includes comprehensive benchmark results. Feedback focuses on addressing a potential DNS rebinding vulnerability in the SSRF logic, improving the robustness of duplicate call detection to cover cycles, handling character encoding and HTTP errors more gracefully during page fetches, and refining the HTML-to-text conversion fallback.

gemini-code-assist · 2026-03-31T09:44:25Z

+    ok, reason = _is_public_host(
+        parsed.hostname,
+        parsed.port or (443 if parsed.scheme == "https" else 80),
+    )
+    if not ok:
+        return reason
+
+    try:
+        import urllib.request
+        from urllib.error import HTTPError as _HTTPError
+        from urllib.parse import urljoin
+
+        # Disable auto-redirect so we can validate each hop for SSRF.
+        # urllib raises HTTPError for 3xx when the handler returns None,
+        # so we catch that and extract the Location header manually.
+        class _NoRedirect(urllib.request.HTTPRedirectHandler):
+            def redirect_request(self, req, fp, code, msg, headers, newurl):
+                return None
+
+        opener = urllib.request.build_opener(_NoRedirect)
+        max_bytes = max_chars * 4 + 1
+        current_url = url
+
+        for _hop in range(5):
+            req = urllib.request.Request(
+                current_url,
+                headers = {"User-Agent": "UnslothStudio/1.0"},
+            )
+            try:
+                resp = opener.open(req, timeout = timeout)


The SSRF protection here is vulnerable to a DNS rebinding attack. _is_public_host resolves the hostname to validate the IP, but opener.open(req) performs its own resolution. A malicious actor could use a DNS server that returns a safe IP for the first resolution and a private/internal IP for the second. To mitigate this, you should resolve the hostname once, validate the IP, and then use that IP address in the request URL while setting the Host header manually. When resolving the hostname, ensure you use socket.getaddrinfo() to support both IPv4 and IPv6 addresses.

References

Use socket.getaddrinfo() to create sockets that support both IPv4 and IPv6 addresses, instead of hardcoding an address family like socket.AF_INET.

gemini-code-assist · 2026-03-31T09:44:25Z

+        def _is_duplicate_call(name: str, args: dict) -> bool:
+            """Block if the immediately previous call was identical and succeeded."""
+            if not _tool_call_history:
+                return False
+            key = _tool_call_key(name, args)
+            last_key, last_failed = _tool_call_history[-1]
+            return last_key == key and not last_failed


The current duplicate call detection only checks the immediately preceding call. Models can sometimes enter cycles involving multiple different tool calls (e.g., A -> B -> A -> B). It would be more robust to check if the current call has appeared anywhere in the history of the current generation turn.

Suggested change

def _is_duplicate_call(name: str, args: dict) -> bool:

"""Block if the immediately previous call was identical and succeeded."""

if not _tool_call_history:

return False

key = _tool_call_key(name, args)

last_key, last_failed = _tool_call_history[-1]

return last_key == key and not last_failed

def _is_duplicate_call(name: str, args: dict) -> bool:

"""Block if this exact call has already succeeded in the current turn."""

if not _tool_call_history:

return False

key = _tool_call_key(name, args)

return any(h_key == key and not h_failed for h_key, h_failed in _tool_call_history)

gemini-code-assist · 2026-03-31T09:44:25Z

+                            "process data you already have, or "
+                            "provide your final answer now."
+                        )
+                        _record_tool_call(tool_name, arguments, failed = False)


This call to _record_tool_call is redundant because the tool call is recorded again at the end of the loop (line 2738) regardless of whether it was a duplicate or a fresh execution. Removing this redundant call keeps the history clean and avoids double-counting attempts.

gemini-code-assist · 2026-03-31T09:44:25Z

+        else:
+            return "Failed to fetch URL: too many redirects."
+
+        raw_html = raw_bytes.decode("utf-8", errors = "replace")


Hardcoding utf-8 for decoding the HTML body may lead to corrupted text for pages using other encodings (e.g., ISO-8859-1). It is better to use the charset provided in the response headers.

Suggested change

raw_html = raw_bytes.decode("utf-8", errors = "replace")

charset = resp.info().get_content_charset() or "utf-8"

raw_html = raw_bytes.decode(charset, errors = "replace")

gemini-code-assist · 2026-03-31T09:44:26Z

+    except _HTTPError:
+        raise


Re-raising _HTTPError here causes the error to be caught by the caller's broad except Exception block in _web_search, resulting in a generic "Search failed" message. It would be better to catch it here, log the exception for debugging purposes, and return a more descriptive error message (including the status code and reason) so the model can understand why the fetch failed.

Suggested change

except _HTTPError:

raise

except _HTTPError as e:

logger.debug(f"HTTP error fetching URL: {e}")

return f"Failed to fetch URL: HTTP {e.code} {e.reason}"

References

Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

gemini-code-assist · 2026-03-31T09:44:26Z

+        text = _re.sub(
+            r"<script[^>]*>.*?</script[^>]*>",
+            "",
+            raw_html,
+            flags = _re.DOTALL | _re.IGNORECASE,
+        )
+        text = _re.sub(
+            r"<style[^>]*>.*?</style[^>]*>", "", text, flags = _re.DOTALL | _re.IGNORECASE
+        )
+        text = _re.sub(r"<[^>]+>", " ", text)
+        text = _re.sub(r"\s+", " ", text).strip()


Using regular expressions to strip HTML is fragile and can be easily bypassed or produce poor results with complex or malformed HTML. While this is a fallback, consider making html2text a required dependency or using a more robust library like beautifulsoup4 if available in the environment.

…e-record - DNS rebinding: resolve hostname once via getaddrinfo, pin the returned IP, rewrite the URL to connect to the pinned IP with a Host header. Each redirect hop re-resolves and re-validates. Closes the TOCTOU window between validation and connection. - Charset: use resp.headers.get_content_charset() instead of hardcoding utf-8, so pages with other encodings decode correctly. - HTTPError: return descriptive "HTTP {code} {reason}" instead of re-raising into a generic "Search failed" message. - Dedup: remove redundant _record_tool_call in the duplicate branch; the single call at the end of the loop handles all cases.

for more information, see https://pre-commit.ci

chatgpt-codex-connector · 2026-03-31T09:52:42Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

danielhanchen · 2026-03-31T09:55:41Z

/gemini review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d41b483688

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T09:58:36Z

+            ip_netloc = f"{pinned_ip}:{cp.port}" if cp.port else pinned_ip
+            pinned_url = urlunparse(cp._replace(netloc = ip_netloc))


Keep hostname in URL when pinning fetch target IP

Rewriting current_url to use pinned_ip in the URL netloc breaks many valid fetches: for HTTPS, urllib's SNI/certificate hostname checks use the URL host (the IP), not the Host header, so domain certificates no longer match; and if pinned_ip is IPv6, the unbracketed netloc produces an invalid URL parse. This means web_search URL-fetch mode can fail for common dual-stack HTTPS sites.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request enhances the tool-calling framework by increasing the maximum iteration cap to 25 and implementing duplicate call detection to prevent infinite loops. It introduces a direct URL fetching feature to the web search tool, accompanied by SSRF protections such as DNS pinning and private IP validation. The update also includes logic to strip leaked tool-call XML from assistant responses, adds system prompt nudges to guide model behavior, and provides benchmark results showing improved performance. I have no feedback to provide.

…#4700) * studio: improve GGUF tool calling accuracy and reliability - Add URL fetching to web_search tool so models can read full page content instead of only getting search snippets. Uses html2text for clean markdown conversion with regex fallback. - Inject current date and behavioral guidance (URL fetch workflow, no repeated queries, use code for data processing) into the tool-use system prompt. - Append error recovery nudge to tool results that indicate failure, helping small models avoid looping on the same broken call. - Strip leaked <tool_call> XML from assistant messages in conversation history and from the outgoing SSE stream. - Raise default max tool iterations from 10 to 25 across backend, model schema, and frontend defaults. - Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain enough content for the model to extract useful information. - Add "IMPORTANT: These are only short snippets" hint to search results so models know to fetch full pages when needed. Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after: - XML leaks in responses: 10/10 -> 0/10 - URL fetch usage: 0 -> 4/10 runs - Runs producing actual correct answers: 0/10 -> 2/10 - Average tool calls per query: 5.5 -> 3.8 (more efficient) - Average response time: 12.3s -> 9.8s * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add tool calling benchmark results across model sizes and quants Tested 16 configurations (4 models x 2 quants x 2 KV cache types) with 10 runs each on NVIDIA B200. Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4 correct songs, 0 XML leaks, 131s average response time. * Add duplicate tool-call detection and final-answer synthesis When the model repeats the exact same tool call (same name + arguments) twice in a row, skip execution and return a redirect message telling it to try a different approach. This prevents the 8x-repeated-query loops observed on 27B and 35B models. When the tool iteration cap (25) is reached, inject a "provide your final answer now" message before the final streaming pass. This lets the model synthesize a useful answer from everything it gathered instead of being silently cut off. Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs): - Repeated query runs: 4/10 -> 2/10 - Cap hits: 1/10 -> 0/10 - All 4/4 accuracy: 5/10 -> 7/10 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CodeQL alert: handle whitespace in script/style closing tags The regex fallback for HTML stripping did not match closing tags with whitespace before the angle bracket (e.g. </script >). Use \s* before > in both script and style patterns. * Address reviewer findings: SSRF, timeout crash, XML regex, dedup - SSRF: resolve hostname via getaddrinfo and reject private, loopback, link-local, multicast, and reserved addresses before fetching - Timeout: handle timeout=None (unlimited mode) in URL fetch path by defaulting to 60s instead of crashing on min(None, 60) - Download cap: read at most max_chars*4+1 bytes instead of the full response body before truncating - XML regex: match both <tool_call> and <function=...> markup in the history/stream cleanup (inference.py) - CodeQL: use [^>]* in closing script/style tags to handle any whitespace or attributes before > - Dedup: track whether each tool call failed so retries after transient errors are allowed; only block consecutive identical calls that both succeeded - Final-answer synthesis: guard on max_tool_iterations > 0 so callers who disable tools do not get a false "used all calls" turn * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect SSRF, SSE streaming regression, dedup off-by-one - SSRF redirect bypass: disable auto-redirect in urllib, manually follow up to 5 hops with host validation at each step. Prevents public URLs from redirecting to loopback/private targets. - SSE streaming: track prev_text on the raw cumulative and strip XML from the delta only, so completed tool_call tags do not cause the cumulative to shrink and drop trailing real text. - Dedup off-by-one: check the immediately previous call (window=1) instead of requiring 2 matching history entries, so the second identical successful call is blocked rather than the third. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect HTTPError handling and tighten error prefixes - Redirect fix: urllib raises HTTPError (not a normal response) when the redirect handler returns None. Catch HTTPError for 3xx codes and extract the Location header from the exception object. - Error prefixes: remove overly broad "No " prefix that matched "No results found." (a valid empty-search outcome, not an error). Replace with specific prefixes like "Blocked:", "No query provided", "Failed to resolve". This ensures empty search results are correctly classified as non-errors for duplicate-call tracking. * Fix SSE cross-chunk XML leaks, cleanup review findings - SSE streaming: sanitize the full cumulative text before diffing against the previous sanitized snapshot, so XML tags that span chunk boundaries are stripped correctly. The previous delta-based approach leaked split tags. - DRAINING fallback: use _strip_tool_markup() helper instead of a manual regex that only handled <tool_call> but not <function=...>. - Move hashlib import, _TOOL_XML_RE compile, and datetime import to module level per style guide. - Remove unused _hit_tool_cap variable. * Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record - DNS rebinding: resolve hostname once via getaddrinfo, pin the returned IP, rewrite the URL to connect to the pinned IP with a Host header. Each redirect hop re-resolves and re-validates. Closes the TOCTOU window between validation and connection. - Charset: use resp.headers.get_content_charset() instead of hardcoding utf-8, so pages with other encodings decode correctly. - HTTPError: return descriptive "HTTP {code} {reason}" instead of re-raising into a generic "Search failed" message. - Dedup: remove redundant _record_tool_call in the duplicate branch; the single call at the end of the loop handles all cases. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

danielhanchen requested review from Manan17, rolandtannous and wasimysaid as code owners March 30, 2026 13:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

68546d7

for more information, see https://pre-commit.ci

github-advanced-security AI found potential problems Mar 30, 2026

View reviewed changes

Comment thread studio/backend/core/inference/tools.py Fixed

chatgpt-codex-connector Bot reviewed Mar 30, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Add tool calling benchmark results across model sizes and quants

8f60cea

Tested 16 configurations (4 models x 2 quants x 2 KV cache types) with 10 runs each on NVIDIA B200. Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4 correct songs, 0 XML leaks, 131s average response time.

chatgpt-codex-connector Bot reviewed Mar 30, 2026

View reviewed changes

danielhanchen and others added 2 commits March 31, 2026 07:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf9deae

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

Fix CodeQL alert: handle whitespace in script/style closing tags

ba52261

The regex fallback for HTML stripping did not match closing tags with whitespace before the angle bracket (e.g. </script >). Use \s* before > in both script and style patterns.

github-advanced-security AI found potential problems Mar 31, 2026

View reviewed changes

Comment thread studio/backend/core/inference/tools.py Fixed

danielhanchen and others added 2 commits March 31, 2026 08:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

6abf067

for more information, see https://pre-commit.ci

danielhanchen and others added 3 commits March 31, 2026 01:22

Merge branch 'main' into studio/improve-gguf-tool-calling

529f793

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d184bf

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

danielhanchen and others added 2 commits March 31, 2026 09:52

[pre-commit.ci] auto fixes from pre-commit.com hooks

d41b483

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

danielhanchen merged commit e159b93 into main Mar 31, 2026
5 checks passed

danielhanchen deleted the studio/improve-gguf-tool-calling branch March 31, 2026 10:06

danielhanchen mentioned this pull request Mar 31, 2026

Studio: simplify tool-call dedup and replace html2text with builtin converter #4722

Merged

3 tasks

This was referenced May 21, 2026

studio: agentic loop dead-ends with context-overflow error when many tool calls accumulate #5562

Open

Studio: don't re-prompt after model produced a complete answer #5714

Open

		with urllib.request.urlopen(req, timeout = timeout) as resp:
		raw_html = resp.read().decode("utf-8", errors = "replace")

		_MAX_PAGE_CHARS = 16000 # limit fetched page text


		def _fetch_page_text(

		def redirect_request(self, req, fp, code, msg, headers, newurl):
		return None

	raw_html = raw_bytes.decode("utf-8", errors = "replace")
	charset = resp.info().get_content_charset() or "utf-8"
	raw_html = raw_bytes.decode(charset, errors = "replace")

-    except _HTTPError:
-        raise
+    except _HTTPError as e:
+        logger.debug(f"HTTP error fetching URL: {e}")
+        return f"Failed to fetch URL: HTTP {e.code} {e.reason}"

		ip_netloc = f"{pinned_ip}:{cp.port}" if cp.port else pinned_ip
		pinned_url = urlunparse(cp._replace(netloc = ip_netloc))

Uh oh!

Conversation

danielhanchen commented Mar 30, 2026

Summary

Changes

Test Results

Files Changed

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Mar 30, 2026

Benchmark Results: Full Cartesian Grid

Key Takeaways

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector Bot commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Mar 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!