studio: improve GGUF tool calling accuracy and reliability#4700
Conversation
- Add URL fetching to web_search tool so models can read full page content instead of only getting search snippets. Uses html2text for clean markdown conversion with regex fallback. - Inject current date and behavioral guidance (URL fetch workflow, no repeated queries, use code for data processing) into the tool-use system prompt. - Append error recovery nudge to tool results that indicate failure, helping small models avoid looping on the same broken call. - Strip leaked <tool_call> XML from assistant messages in conversation history and from the outgoing SSE stream. - Raise default max tool iterations from 10 to 25 across backend, model schema, and frontend defaults. - Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain enough content for the model to extract useful information. - Add "IMPORTANT: These are only short snippets" hint to search results so models know to fetch full pages when needed. Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after: - XML leaks in responses: 10/10 -> 0/10 - URL fetch usage: 0 -> 4/10 runs - Runs producing actual correct answers: 0/10 -> 2/10 - Average tool calls per query: 5.5 -> 3.8 (more efficient) - Average response time: 12.3s -> 9.8s
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68546d7aaf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| parsed = urlparse(url) | ||
| if parsed.scheme not in ("http", "https"): | ||
| return f"Blocked: only http/https URLs are allowed (got {parsed.scheme!r})." |
There was a problem hiding this comment.
Block internal hosts in URL fetcher
_fetch_page_text only validates the URL scheme, so web_search can fetch http(s) targets on loopback, link-local, or private networks (for example 127.0.0.1, localhost, 169.254.169.254, RFC1918 ranges). Because tool arguments come from model output, this introduces an SSRF path that can exfiltrate internal metadata/services in production deployments; host/IP allowlisting or private-range blocking is needed in addition to scheme checks.
Useful? React with 👍 / 👎.
| """ | ||
| # Direct URL fetch mode | ||
| if url and url.strip(): | ||
| return _fetch_page_text(url.strip(), timeout = min(timeout, 60)) |
There was a problem hiding this comment.
Handle unlimited timeout in URL mode
When users set tool_call_timeout=9999 (documented as no limit), the caller passes timeout=None; this branch then executes min(timeout, 60), which raises TypeError for URL fetches before any network call. That means web_search with a url argument crashes the tool path under the documented unlimited-timeout setting.
Useful? React with 👍 / 👎.
| with urllib.request.urlopen(req, timeout = timeout) as resp: | ||
| raw_html = resp.read().decode("utf-8", errors = "replace") |
There was a problem hiding this comment.
Enforce response size before reading fetched pages
The function truncates text to _MAX_PAGE_CHARS only after resp.read() has already loaded the entire body into memory. Large or binary URLs can therefore consume excessive memory/CPU and destabilize inference workers despite the apparent page-size cap; read limits should be applied during download, not post-hoc.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request enhances the tool-use framework by adding a direct URL fetching capability to the web_search tool, increasing the default maximum tool iterations from 10 to 25, and implementing system prompt nudges to improve model performance. It also includes logic to sanitize conversation history and output streams by stripping leaked tool-call XML tags. Feedback was provided regarding the consistency of timeout values in the fetching utility and the placement of imports according to PEP 8 guidelines.
| _MAX_PAGE_CHARS = 16000 # limit fetched page text | ||
|
|
||
|
|
||
| def _fetch_page_text( |
There was a problem hiding this comment.
The _fetch_page_text function has a default timeout of 30 seconds. However, when called from _web_search, the timeout is explicitly set to min(timeout, 60). This means the effective timeout for _fetch_page_text will always be capped at 60 seconds, and its own default of 30 seconds will only apply if it's called directly, not through _web_search.
To improve clarity and avoid potential confusion, consider making the default timeout for _fetch_page_text consistent with its intended maximum usage (e.g., 60 seconds if that's the hard limit for web fetches) or add a constant for this maximum timeout.
| _has_web = "web_search" in _tool_names | ||
| _has_code = "python" in _tool_names or "terminal" in _tool_names | ||
|
|
||
| from datetime import date as _date |
There was a problem hiding this comment.
The import statement from datetime import date as _date is placed inside the openai_chat_completions function. It is generally considered best practice to place all standard library imports at the top of the file, outside of any function definitions. This improves readability and ensures that dependencies are declared clearly at the module level.
References
- PEP 8: Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (link)
Tested 16 configurations (4 models x 2 quants x 2 KV cache types) with 10 runs each on NVIDIA B200. Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4 correct songs, 0 XML leaks, 131s average response time.
Benchmark Results: Full Cartesian GridTested 16 configurations across 4 model sizes, 2 quantizations, and 2 KV cache types. 10 runs per config, NVIDIA B200. Prompt: "List and categorize all the songs that charted #3 on the Billboard Hot 100 in 2015." Ground truth: 4 songs peaked at #3 -- "Love Me like You Do", "Earned It", "Watch Me", "Drag Me Down".
Key Takeaways
Full benchmark results are in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f60cea2f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cumulative = _TOOL_XML_RE.sub("", cumulative) | ||
| new_text = cumulative[len(prev_text) :] | ||
| prev_text = cumulative |
There was a problem hiding this comment.
Handle shrinking cumulative text after XML redaction
gguf_tool_stream computes deltas by slicing cumulative[len(prev_text):], but cumulative is now rewritten by _TOOL_XML_RE.sub(...) first. When a later chunk closes a previously streamed <tool_call>...</tool_call> span, the redacted cumulative text becomes shorter than prev_text, so the slice offset is wrong and subsequent user-visible text is dropped or garbled instead of streamed correctly. This shows up precisely in the XML-leak scenario this block is trying to recover from, and can also affect legitimate outputs that include that tag pattern.
Useful? React with 👍 / 👎.
When the model repeats the exact same tool call (same name + arguments) twice in a row, skip execution and return a redirect message telling it to try a different approach. This prevents the 8x-repeated-query loops observed on 27B and 35B models. When the tool iteration cap (25) is reached, inject a "provide your final answer now" message before the final streaming pass. This lets the model synthesize a useful answer from everything it gathered instead of being silently cut off. Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs): - Repeated query runs: 4/10 -> 2/10 - Cap hits: 1/10 -> 0/10 - All 4/4 accuracy: 5/10 -> 7/10
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf9deae02c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| conversation.append( | ||
| { | ||
| "role": "user", | ||
| "content": ( | ||
| "You have used all available tool calls. Based on " |
There was a problem hiding this comment.
Gate tool-cap nudge on actual cap exhaustion
This block always appends a synthetic user message after the loop, even when max_tool_iterations is 0 (documented as “disabled” via max_tool_calls_per_message). In that case no tool call happened, but the model still receives “You have used all available tool calls…”, which injects false context and can alter the final answer path for users intentionally disabling tools. Append this nudge only when the loop actually consumed the configured cap.
Useful? React with 👍 / 👎.
The regex fallback for HTML stripping did not match closing tags with whitespace before the angle bracket (e.g. </script >). Use \s* before > in both script and style patterns.
- SSRF: resolve hostname via getaddrinfo and reject private, loopback, link-local, multicast, and reserved addresses before fetching - Timeout: handle timeout=None (unlimited mode) in URL fetch path by defaulting to 60s instead of crashing on min(None, 60) - Download cap: read at most max_chars*4+1 bytes instead of the full response body before truncating - XML regex: match both <tool_call> and <function=...> markup in the history/stream cleanup (inference.py) - CodeQL: use [^>]* in closing script/style tags to handle any whitespace or attributes before > - Dedup: track whether each tool call failed so retries after transient errors are allowed; only block consecutive identical calls that both succeeded - Final-answer synthesis: guard on max_tool_iterations > 0 so callers who disable tools do not get a false "used all calls" turn
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
- SSRF redirect bypass: disable auto-redirect in urllib, manually follow up to 5 hops with host validation at each step. Prevents public URLs from redirecting to loopback/private targets. - SSE streaming: track prev_text on the raw cumulative and strip XML from the delta only, so completed tool_call tags do not cause the cumulative to shrink and drop trailing real text. - Dedup off-by-one: check the immediately previous call (window=1) instead of requiring 2 matching history entries, so the second identical successful call is blocked rather than the third.
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fec5f2cb00
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| def redirect_request(self, req, fp, code, msg, headers, newurl): | ||
| return None |
There was a problem hiding this comment.
Handle HTTP redirects without triggering urllib errors
Returning None from _NoRedirect.redirect_request causes urllib to treat 3xx responses as HTTPError via the default error handler, so _fetch_page_text never reaches the if resp.status in (301, ...) redirect branch. In practice, URLs that rely on normal redirects (for example http→https or canonical-path redirects) will fail with Failed to fetch URL instead of being followed through the intended validated-hop loop.
Useful? React with 👍 / 👎.
| current_url, | ||
| headers = {"User-Agent": "UnslothStudio/1.0"}, | ||
| ) | ||
| resp = opener.open(req, timeout = timeout) |
There was a problem hiding this comment.
Prevent DNS rebinding between host check and fetch
The private-range check is done on a preflight DNS lookup, but the subsequent opener.open(req, ...) call performs a fresh resolution when connecting. That leaves a DNS-rebinding window where an attacker-controlled hostname can resolve to a public IP during validation and then to a private/loopback IP during the actual request, bypassing the SSRF protection added here.
Useful? React with 👍 / 👎.
- Redirect fix: urllib raises HTTPError (not a normal response) when the redirect handler returns None. Catch HTTPError for 3xx codes and extract the Location header from the exception object. - Error prefixes: remove overly broad "No " prefix that matched "No results found." (a valid empty-search outcome, not an error). Replace with specific prefixes like "Blocked:", "No query provided", "Failed to resolve". This ensures empty search results are correctly classified as non-errors for duplicate-call tracking.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c80d30f4fc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for _msg in gguf_messages: | ||
| if _msg.get("role") == "assistant" and isinstance( | ||
| _msg.get("content"), str | ||
| ): | ||
| _msg["content"] = _TOOL_XML_RE.sub("", _msg["content"]).strip() |
There was a problem hiding this comment.
Respect auto-heal setting when stripping tool-call XML
This sanitization runs unconditionally in the tool path, so even when auto_heal_tool_calls is false the server still removes <tool_call>...</tool_call> / <function=...>...</function> content from assistant messages (and the same regex is later applied to streamed deltas). That breaks the advertised opt-out behavior and can silently corrupt legitimate outputs that include those tags as literal text, while also altering conversation history before the next model step.
Useful? React with 👍 / 👎.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances tool-calling capabilities by increasing the iteration limit, adding duplicate call detection, and extending the web search tool to support direct URL fetching with SSRF protection. It also introduces system prompt nudges to guide model behavior, implements logic to strip leaked tool-call XML, and includes comprehensive benchmark results for GGUF tool calling. Review feedback identifies several optimization opportunities, including moving imports and regex compilations to the module level, removing an unused variable, and refactoring redundant logic to use existing helper functions.
| # where the model repeats the exact same call. Retries after | ||
| # a transient failure are allowed (only block when the previous | ||
| # identical call succeeded). | ||
| import hashlib as _hl |
There was a problem hiding this comment.
Importing hashlib inside the function on every call is inefficient. This import should be moved to the top of the file to follow standard Python practices and improve performance.
References
- Imports should be at the top of the file, after any module comments and docstrings, and before module globals and constants. (link)
| key = _tool_call_key(name, args) | ||
| _tool_call_history.append((key, failed)) | ||
|
|
||
| _hit_tool_cap = False |
| content_accum = re.sub( | ||
| r"<tool_call>.*?</tool_call>", | ||
| "", | ||
| content_accum, | ||
| flags = re.DOTALL, | ||
| ).strip() |
There was a problem hiding this comment.
This manual regex substitution is redundant and less comprehensive than the _strip_tool_markup helper function defined earlier (line 2160). Using the helper ensures that both \u003ctool_call\u003e and \u003cfunction=...\u003e tags are handled consistently, including cases where tags might be unclosed at the end of the stream. Additionally, _strip_tool_markup respects the auto_heal_tool_calls setting.
content_accum = _strip_tool_markup(content_accum, final = True)| _has_web = "web_search" in _tool_names | ||
| _has_code = "python" in _tool_names or "terminal" in _tool_names | ||
|
|
||
| from datetime import date as _date |
There was a problem hiding this comment.
Importing date inside the request handler on every call is inefficient. This import should be moved to the top of the file.
References
- Imports should be at the top of the file, after any module comments and docstrings, and before module globals and constants. (link)
| _TOOL_XML_RE = _re.compile( | ||
| r"<tool_call>.*?</tool_call>|<function=\w+>.*?</function>", | ||
| _re.DOTALL, | ||
| ) |
- SSE streaming: sanitize the full cumulative text before diffing against the previous sanitized snapshot, so XML tags that span chunk boundaries are stripped correctly. The previous delta-based approach leaked split tags. - DRAINING fallback: use _strip_tool_markup() helper instead of a manual regex that only handled <tool_call> but not <function=...>. - Move hashlib import, _TOOL_XML_RE compile, and datetime import to module level per style guide. - Remove unused _hit_tool_cap variable.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances tool-calling capabilities by increasing the maximum tool iteration limit to 25, implementing duplicate tool-call detection using hashing, and adding a direct URL fetching feature to the web search tool with SSRF protection. It also introduces system prompt nudges to guide the model's tool usage, adds logic to sanitize leaked tool-call XML from the output, and includes comprehensive benchmark results. Feedback focuses on addressing a potential DNS rebinding vulnerability in the SSRF logic, improving the robustness of duplicate call detection to cover cycles, handling character encoding and HTTP errors more gracefully during page fetches, and refining the HTML-to-text conversion fallback.
| ok, reason = _is_public_host( | ||
| parsed.hostname, | ||
| parsed.port or (443 if parsed.scheme == "https" else 80), | ||
| ) | ||
| if not ok: | ||
| return reason | ||
|
|
||
| try: | ||
| import urllib.request | ||
| from urllib.error import HTTPError as _HTTPError | ||
| from urllib.parse import urljoin | ||
|
|
||
| # Disable auto-redirect so we can validate each hop for SSRF. | ||
| # urllib raises HTTPError for 3xx when the handler returns None, | ||
| # so we catch that and extract the Location header manually. | ||
| class _NoRedirect(urllib.request.HTTPRedirectHandler): | ||
| def redirect_request(self, req, fp, code, msg, headers, newurl): | ||
| return None | ||
|
|
||
| opener = urllib.request.build_opener(_NoRedirect) | ||
| max_bytes = max_chars * 4 + 1 | ||
| current_url = url | ||
|
|
||
| for _hop in range(5): | ||
| req = urllib.request.Request( | ||
| current_url, | ||
| headers = {"User-Agent": "UnslothStudio/1.0"}, | ||
| ) | ||
| try: | ||
| resp = opener.open(req, timeout = timeout) |
There was a problem hiding this comment.
The SSRF protection here is vulnerable to a DNS rebinding attack. _is_public_host resolves the hostname to validate the IP, but opener.open(req) performs its own resolution. A malicious actor could use a DNS server that returns a safe IP for the first resolution and a private/internal IP for the second. To mitigate this, you should resolve the hostname once, validate the IP, and then use that IP address in the request URL while setting the Host header manually. When resolving the hostname, ensure you use socket.getaddrinfo() to support both IPv4 and IPv6 addresses.
References
- Use socket.getaddrinfo() to create sockets that support both IPv4 and IPv6 addresses, instead of hardcoding an address family like socket.AF_INET.
| def _is_duplicate_call(name: str, args: dict) -> bool: | ||
| """Block if the immediately previous call was identical and succeeded.""" | ||
| if not _tool_call_history: | ||
| return False | ||
| key = _tool_call_key(name, args) | ||
| last_key, last_failed = _tool_call_history[-1] | ||
| return last_key == key and not last_failed |
There was a problem hiding this comment.
The current duplicate call detection only checks the immediately preceding call. Models can sometimes enter cycles involving multiple different tool calls (e.g., A -> B -> A -> B). It would be more robust to check if the current call has appeared anywhere in the history of the current generation turn.
| def _is_duplicate_call(name: str, args: dict) -> bool: | |
| """Block if the immediately previous call was identical and succeeded.""" | |
| if not _tool_call_history: | |
| return False | |
| key = _tool_call_key(name, args) | |
| last_key, last_failed = _tool_call_history[-1] | |
| return last_key == key and not last_failed | |
| def _is_duplicate_call(name: str, args: dict) -> bool: | |
| """Block if this exact call has already succeeded in the current turn.""" | |
| if not _tool_call_history: | |
| return False | |
| key = _tool_call_key(name, args) | |
| return any(h_key == key and not h_failed for h_key, h_failed in _tool_call_history) |
| "process data you already have, or " | ||
| "provide your final answer now." | ||
| ) | ||
| _record_tool_call(tool_name, arguments, failed = False) |
There was a problem hiding this comment.
| else: | ||
| return "Failed to fetch URL: too many redirects." | ||
|
|
||
| raw_html = raw_bytes.decode("utf-8", errors = "replace") |
There was a problem hiding this comment.
Hardcoding utf-8 for decoding the HTML body may lead to corrupted text for pages using other encodings (e.g., ISO-8859-1). It is better to use the charset provided in the response headers.
| raw_html = raw_bytes.decode("utf-8", errors = "replace") | |
| charset = resp.info().get_content_charset() or "utf-8" | |
| raw_html = raw_bytes.decode(charset, errors = "replace") |
| except _HTTPError: | ||
| raise |
There was a problem hiding this comment.
Re-raising _HTTPError here causes the error to be caught by the caller's broad except Exception block in _web_search, resulting in a generic "Search failed" message. It would be better to catch it here, log the exception for debugging purposes, and return a more descriptive error message (including the status code and reason) so the model can understand why the fetch failed.
| except _HTTPError: | |
| raise | |
| except _HTTPError as e: | |
| logger.debug(f"HTTP error fetching URL: {e}") | |
| return f"Failed to fetch URL: HTTP {e.code} {e.reason}" |
References
- Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.
| text = _re.sub( | ||
| r"<script[^>]*>.*?</script[^>]*>", | ||
| "", | ||
| raw_html, | ||
| flags = _re.DOTALL | _re.IGNORECASE, | ||
| ) | ||
| text = _re.sub( | ||
| r"<style[^>]*>.*?</style[^>]*>", "", text, flags = _re.DOTALL | _re.IGNORECASE | ||
| ) | ||
| text = _re.sub(r"<[^>]+>", " ", text) | ||
| text = _re.sub(r"\s+", " ", text).strip() |
There was a problem hiding this comment.
…e-record
- DNS rebinding: resolve hostname once via getaddrinfo, pin the
returned IP, rewrite the URL to connect to the pinned IP with
a Host header. Each redirect hop re-resolves and re-validates.
Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
the single call at the end of the loop handles all cases.
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d41b483688
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ip_netloc = f"{pinned_ip}:{cp.port}" if cp.port else pinned_ip | ||
| pinned_url = urlunparse(cp._replace(netloc = ip_netloc)) |
There was a problem hiding this comment.
Keep hostname in URL when pinning fetch target IP
Rewriting current_url to use pinned_ip in the URL netloc breaks many valid fetches: for HTTPS, urllib's SNI/certificate hostname checks use the URL host (the IP), not the Host header, so domain certificates no longer match; and if pinned_ip is IPv6, the unbracketed netloc produces an invalid URL parse. This means web_search URL-fetch mode can fail for common dual-stack HTTPS sites.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request enhances the tool-calling framework by increasing the maximum iteration cap to 25 and implementing duplicate call detection to prevent infinite loops. It introduces a direct URL fetching feature to the web search tool, accompanied by SSRF protections such as DNS pinning and private IP validation. The update also includes logic to strip leaked tool-call XML from assistant responses, adds system prompt nudges to guide model behavior, and provides benchmark results showing improved performance. I have no feedback to provide.
…#4700) * studio: improve GGUF tool calling accuracy and reliability - Add URL fetching to web_search tool so models can read full page content instead of only getting search snippets. Uses html2text for clean markdown conversion with regex fallback. - Inject current date and behavioral guidance (URL fetch workflow, no repeated queries, use code for data processing) into the tool-use system prompt. - Append error recovery nudge to tool results that indicate failure, helping small models avoid looping on the same broken call. - Strip leaked <tool_call> XML from assistant messages in conversation history and from the outgoing SSE stream. - Raise default max tool iterations from 10 to 25 across backend, model schema, and frontend defaults. - Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain enough content for the model to extract useful information. - Add "IMPORTANT: These are only short snippets" hint to search results so models know to fetch full pages when needed. Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after: - XML leaks in responses: 10/10 -> 0/10 - URL fetch usage: 0 -> 4/10 runs - Runs producing actual correct answers: 0/10 -> 2/10 - Average tool calls per query: 5.5 -> 3.8 (more efficient) - Average response time: 12.3s -> 9.8s * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add tool calling benchmark results across model sizes and quants Tested 16 configurations (4 models x 2 quants x 2 KV cache types) with 10 runs each on NVIDIA B200. Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4 correct songs, 0 XML leaks, 131s average response time. * Add duplicate tool-call detection and final-answer synthesis When the model repeats the exact same tool call (same name + arguments) twice in a row, skip execution and return a redirect message telling it to try a different approach. This prevents the 8x-repeated-query loops observed on 27B and 35B models. When the tool iteration cap (25) is reached, inject a "provide your final answer now" message before the final streaming pass. This lets the model synthesize a useful answer from everything it gathered instead of being silently cut off. Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs): - Repeated query runs: 4/10 -> 2/10 - Cap hits: 1/10 -> 0/10 - All 4/4 accuracy: 5/10 -> 7/10 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CodeQL alert: handle whitespace in script/style closing tags The regex fallback for HTML stripping did not match closing tags with whitespace before the angle bracket (e.g. </script >). Use \s* before > in both script and style patterns. * Address reviewer findings: SSRF, timeout crash, XML regex, dedup - SSRF: resolve hostname via getaddrinfo and reject private, loopback, link-local, multicast, and reserved addresses before fetching - Timeout: handle timeout=None (unlimited mode) in URL fetch path by defaulting to 60s instead of crashing on min(None, 60) - Download cap: read at most max_chars*4+1 bytes instead of the full response body before truncating - XML regex: match both <tool_call> and <function=...> markup in the history/stream cleanup (inference.py) - CodeQL: use [^>]* in closing script/style tags to handle any whitespace or attributes before > - Dedup: track whether each tool call failed so retries after transient errors are allowed; only block consecutive identical calls that both succeeded - Final-answer synthesis: guard on max_tool_iterations > 0 so callers who disable tools do not get a false "used all calls" turn * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect SSRF, SSE streaming regression, dedup off-by-one - SSRF redirect bypass: disable auto-redirect in urllib, manually follow up to 5 hops with host validation at each step. Prevents public URLs from redirecting to loopback/private targets. - SSE streaming: track prev_text on the raw cumulative and strip XML from the delta only, so completed tool_call tags do not cause the cumulative to shrink and drop trailing real text. - Dedup off-by-one: check the immediately previous call (window=1) instead of requiring 2 matching history entries, so the second identical successful call is blocked rather than the third. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect HTTPError handling and tighten error prefixes - Redirect fix: urllib raises HTTPError (not a normal response) when the redirect handler returns None. Catch HTTPError for 3xx codes and extract the Location header from the exception object. - Error prefixes: remove overly broad "No " prefix that matched "No results found." (a valid empty-search outcome, not an error). Replace with specific prefixes like "Blocked:", "No query provided", "Failed to resolve". This ensures empty search results are correctly classified as non-errors for duplicate-call tracking. * Fix SSE cross-chunk XML leaks, cleanup review findings - SSE streaming: sanitize the full cumulative text before diffing against the previous sanitized snapshot, so XML tags that span chunk boundaries are stripped correctly. The previous delta-based approach leaked split tags. - DRAINING fallback: use _strip_tool_markup() helper instead of a manual regex that only handled <tool_call> but not <function=...>. - Move hashlib import, _TOOL_XML_RE compile, and datetime import to module level per style guide. - Remove unused _hit_tool_cap variable. * Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record - DNS rebinding: resolve hostname once via getaddrinfo, pin the returned IP, rewrite the URL to connect to the pinned IP with a Host header. Each redirect hop re-resolves and re-validates. Closes the TOCTOU window between validation and connection. - Charset: use resp.headers.get_content_charset() instead of hardcoding utf-8, so pages with other encodings decode correctly. - HTTPError: return descriptive "HTTP {code} {reason}" instead of re-raising into a generic "Search failed" message. - Dedup: remove redundant _record_tool_call in the duplicate branch; the single call at the end of the loop handles all cases. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Summary
Improves tool calling for GGUF models in Unsloth Studio, particularly for smaller models (4B) that struggle with multi-step agentic workflows. The core changes add URL fetching support to web_search, inject better behavioral guidance into the system prompt, and clean up XML artifacts that leak into responses.
Changes
URL fetching in web_search (
tools.py)urlparameter to theweb_searchtool so models can fetch full page content from URLs found in search results, instead of being limited to short snippetshtml2textfor clean HTML-to-markdown conversion (headings, links, lists preserved), with regex-based fallback ifhtml2textis not installed_MAX_PAGE_CHARSfrom 4,000 to 16,000 so fetched pages contain enough context for the model to extract structured dataurlparameterSystem prompt nudge (
inference.py)Error recovery nudge (
llama_cpp.py)XML cleanup (
inference.py,llama_cpp.py)<tool_call>...</tool_call>XML from assistant messages in conversation history before sending to the modelMax tool iterations (
inference.py,llama_cpp.py,models/inference.py, frontend store)Test Results
Tested with
unsloth/Qwen3.5-4B-GGUF(UD-Q4_K_XL), web search + code execution + thinking enabled. Prompt: "List and categorize all the songs that charted #3 on the Billboard Hot 100 in 2015." 10 runs per configuration.When the model does use URL fetching, it works well -- the best run correctly identified all 4 songs that peaked at #3 (Love Me like You Do, Earned It, Watch Me, Drag Me Down) by fetching and parsing the full Wikipedia table.
The remaining accuracy gap is a fundamental small-model limitation: the 4B model often generates "let me fetch that page" as text output rather than actually emitting a tool call. Larger models (9B+) should see higher accuracy with the same infrastructure.
Files Changed
studio/backend/core/inference/tools.pystudio/backend/routes/inference.pystudio/backend/core/inference/llama_cpp.pystudio/backend/models/inference.pystudio/frontend/.../chat-runtime-store.ts