Skip to content

studio: improve GGUF tool calling accuracy and reliability#4700

Merged
danielhanchen merged 15 commits into
mainfrom
studio/improve-gguf-tool-calling
Mar 31, 2026
Merged

studio: improve GGUF tool calling accuracy and reliability#4700
danielhanchen merged 15 commits into
mainfrom
studio/improve-gguf-tool-calling

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Improves tool calling for GGUF models in Unsloth Studio, particularly for smaller models (4B) that struggle with multi-step agentic workflows. The core changes add URL fetching support to web_search, inject better behavioral guidance into the system prompt, and clean up XML artifacts that leak into responses.

Changes

URL fetching in web_search (tools.py)

  • Added a url parameter to the web_search tool so models can fetch full page content from URLs found in search results, instead of being limited to short snippets
  • Uses html2text for clean HTML-to-markdown conversion (headings, links, lists preserved), with regex-based fallback if html2text is not installed
  • Increased _MAX_PAGE_CHARS from 4,000 to 16,000 so fetched pages contain enough context for the model to extract structured data
  • Search results now include an explicit hint that snippets are truncated and full content is available via the url parameter

System prompt nudge (inference.py)

  • Injects the current date so models can judge recency of search results
  • Adds explicit guidance: fetch URLs from search results for full content, do not repeat the same query, use code execution for data processing
  • Adapts the nudge based on which tools are enabled (web only, code only, or both)

Error recovery nudge (llama_cpp.py)

  • When a tool result starts with an error prefix (e.g. "Error", "Search failed", "Blocked"), appends a recovery hint telling the model to try a different approach
  • Prevents small models from looping on the same broken tool call

XML cleanup (inference.py, llama_cpp.py)

  • Strips <tool_call>...</tool_call> XML from assistant messages in conversation history before sending to the model
  • Strips the same XML from the outgoing SSE stream so users never see raw tool-call markup

Max tool iterations (inference.py, llama_cpp.py, models/inference.py, frontend store)

  • Raised default from 10 to 25 across backend defaults, Pydantic model schema, and frontend TypeScript store

Test Results

Tested with unsloth/Qwen3.5-4B-GGUF (UD-Q4_K_XL), web search + code execution + thinking enabled. Prompt: "List and categorize all the songs that charted #3 on the Billboard Hot 100 in 2015." 10 runs per configuration.

Metric Before After
XML leaks in response 10/10 0/10
URL fetches used 0 4/10 runs
Runs with correct song names 0/10 2/10
Avg tool calls 5.5 3.8
Avg response time 12.3s 9.8s

When the model does use URL fetching, it works well -- the best run correctly identified all 4 songs that peaked at #3 (Love Me like You Do, Earned It, Watch Me, Drag Me Down) by fetching and parsing the full Wikipedia table.

The remaining accuracy gap is a fundamental small-model limitation: the 4B model often generates "let me fetch that page" as text output rather than actually emitting a tool call. Larger models (9B+) should see higher accuracy with the same infrastructure.

Files Changed

File Changes
studio/backend/core/inference/tools.py URL fetch support, html2text, _MAX_PAGE_CHARS 4k->16k, snippet hint
studio/backend/routes/inference.py System prompt nudge (date + behavioral rules), XML stripping, max iterations 10->25
studio/backend/core/inference/llama_cpp.py Error recovery nudge, XML stripping in agentic loop, max iterations 10->25
studio/backend/models/inference.py Default max_tool_calls_per_message 10->25
studio/frontend/.../chat-runtime-store.ts Frontend default max tool calls 10->25

- Add URL fetching to web_search tool so models can read full page
  content instead of only getting search snippets. Uses html2text for
  clean markdown conversion with regex fallback.
- Inject current date and behavioral guidance (URL fetch workflow,
  no repeated queries, use code for data processing) into the
  tool-use system prompt.
- Append error recovery nudge to tool results that indicate failure,
  helping small models avoid looping on the same broken call.
- Strip leaked <tool_call> XML from assistant messages in conversation
  history and from the outgoing SSE stream.
- Raise default max tool iterations from 10 to 25 across backend,
  model schema, and frontend defaults.
- Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain
  enough content for the model to extract useful information.
- Add "IMPORTANT: These are only short snippets" hint to search
  results so models know to fetch full pages when needed.

Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after:
- XML leaks in responses: 10/10 -> 0/10
- URL fetch usage: 0 -> 4/10 runs
- Runs producing actual correct answers: 0/10 -> 2/10
- Average tool calls per query: 5.5 -> 3.8 (more efficient)
- Average response time: 12.3s -> 9.8s
Comment thread studio/backend/core/inference/tools.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68546d7aaf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +170 to +172
parsed = urlparse(url)
if parsed.scheme not in ("http", "https"):
return f"Blocked: only http/https URLs are allowed (got {parsed.scheme!r})."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block internal hosts in URL fetcher

_fetch_page_text only validates the URL scheme, so web_search can fetch http(s) targets on loopback, link-local, or private networks (for example 127.0.0.1, localhost, 169.254.169.254, RFC1918 ranges). Because tool arguments come from model output, this introduces an SSRF path that can exfiltrate internal metadata/services in production deployments; host/IP allowlisting or private-range blocking is needed in addition to scheme checks.

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/inference/tools.py Outdated
"""
# Direct URL fetch mode
if url and url.strip():
return _fetch_page_text(url.strip(), timeout = min(timeout, 60))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle unlimited timeout in URL mode

When users set tool_call_timeout=9999 (documented as no limit), the caller passes timeout=None; this branch then executes min(timeout, 60), which raises TypeError for URL fetches before any network call. That means web_search with a url argument crashes the tool path under the documented unlimited-timeout setting.

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines +181 to +182
with urllib.request.urlopen(req, timeout = timeout) as resp:
raw_html = resp.read().decode("utf-8", errors = "replace")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce response size before reading fetched pages

The function truncates text to _MAX_PAGE_CHARS only after resp.read() has already loaded the entire body into memory. Large or binary URLs can therefore consume excessive memory/CPU and destabilize inference workers despite the apparent page-size cap; read limits should be applied during download, not post-hoc.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the tool-use framework by adding a direct URL fetching capability to the web_search tool, increasing the default maximum tool iterations from 10 to 25, and implementing system prompt nudges to improve model performance. It also includes logic to sanitize conversation history and output streams by stripping leaked tool-call XML tags. Feedback was provided regarding the consistency of timeout values in the fetching utility and the placement of imports according to PEP 8 guidelines.

_MAX_PAGE_CHARS = 16000 # limit fetched page text


def _fetch_page_text(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _fetch_page_text function has a default timeout of 30 seconds. However, when called from _web_search, the timeout is explicitly set to min(timeout, 60). This means the effective timeout for _fetch_page_text will always be capped at 60 seconds, and its own default of 30 seconds will only apply if it's called directly, not through _web_search.

To improve clarity and avoid potential confusion, consider making the default timeout for _fetch_page_text consistent with its intended maximum usage (e.g., 60 seconds if that's the hard limit for web fetches) or add a constant for this maximum timeout.

Comment thread studio/backend/routes/inference.py Outdated
_has_web = "web_search" in _tool_names
_has_code = "python" in _tool_names or "terminal" in _tool_names

from datetime import date as _date

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import statement from datetime import date as _date is placed inside the openai_chat_completions function. It is generally considered best practice to place all standard library imports at the top of the file, outside of any function definitions. This improves readability and ensures that dependencies are declared clearly at the module level.

References
  1. PEP 8: Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (link)

Tested 16 configurations (4 models x 2 quants x 2 KV cache types)
with 10 runs each on NVIDIA B200.

Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4
correct songs, 0 XML leaks, 131s average response time.
@danielhanchen

Copy link
Copy Markdown
Member Author

Benchmark Results: Full Cartesian Grid

Tested 16 configurations across 4 model sizes, 2 quantizations, and 2 KV cache types. 10 runs per config, NVIDIA B200.

Prompt: "List and categorize all the songs that charted #3 on the Billboard Hot 100 in 2015."

Ground truth: 4 songs peaked at #3 -- "Love Me like You Do", "Earned It", "Watch Me", "Drag Me Down".

Model Quant KV Cache OK/10 Avg Time Tools XML Leaks URL Fetch Peak3 Avg All 4/4 Best Songs
4B UD-Q4_K_XL f16 10/10 9.8s 3.5 0/10 4/10 0.8/4 2/10 9
4B UD-Q4_K_XL bf16 10/10 10.6s 4.5 0/10 4/10 0.4/4 1/10 5
4B Q8_0 f16 10/10 4.9s 2.4 0/10 8/10 0.4/4 1/10 5
4B Q8_0 bf16 10/10 8.0s 3.0 0/10 5/10 0.0/4 0/10 0
9B UD-Q4_K_XL f16 10/10 6.7s 2.0 0/10 5/10 0.0/4 0/10 3
9B UD-Q4_K_XL bf16 9/10 49.5s 2.4 0/10 5/10 0.0/4 0/10 1
9B Q8_0 f16 10/10 7.4s 2.5 0/10 5/10 0.0/4 0/10 2
9B Q8_0 bf16 10/10 10.4s 2.7 0/10 6/10 1.0/4 2/10 15
27B UD-Q4_K_XL bf16 9/10 131.1s 13.8 0/10 7/10 2.7/4 6/10 27
27B UD-Q4_K_XL f16 7/10 201.6s 14.1 0/10 8/10 2.0/4 5/10 26
27B Q8_0 f16 4/10 312.5s 16.0 1/10 10/10 2.4/4 6/10 28
27B Q8_0 bf16 5/10 258.4s 16.5 2/10 10/10 0.9/4 1/10 27
35B-A3B UD-Q4_K_XL f16 3/10 353.6s 14.7 1/10 6/10 1.2/4 3/10 27
35B-A3B UD-Q4_K_XL bf16 3/10 356.2s 17.2 1/10 8/10 1.6/4 4/10 27
35B-A3B Q8_0 f16 2/10 372.1s 17.6 1/10 7/10 1.2/4 3/10 26
35B-A3B Q8_0 bf16 6/10 267.7s 17.5 1/10 8/10 2.4/4 6/10 27

Key Takeaways

  1. 27B UD-Q4_K_XL + bf16 KV is the sweet spot -- 6/10 all-4/4 accuracy, 0 XML leaks, 131s average, 9/10 reliability.

  2. Larger models use tools far more effectively. 27B/35B averaged 14-17 tool calls per query with multiple URL fetches, while 4B/9B averaged only 2-4 tool calls.

  3. XML leaks are eliminated for small models (0/10) and near-zero for large models (0-2/10).

  4. 9B was surprisingly weaker than 4B on this task. It rarely extracted data from fetched pages. May need different prompting for complex multi-step retrieval.

  5. 35B-A3B had reliability issues -- most runs timed out due to slow per-token generation across many agentic iterations. When it completed, accuracy matched 27B.

  6. bf16 KV cache helped 27B significantly -- 131s vs 202s, and 6/10 vs 5/10 accuracy. For other sizes the effect was mixed.

Full benchmark results are in studio/backend/tests/tool_calling_benchmark_results.md.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f60cea2f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines 1215 to 1217
cumulative = _TOOL_XML_RE.sub("", cumulative)
new_text = cumulative[len(prev_text) :]
prev_text = cumulative

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle shrinking cumulative text after XML redaction

gguf_tool_stream computes deltas by slicing cumulative[len(prev_text):], but cumulative is now rewritten by _TOOL_XML_RE.sub(...) first. When a later chunk closes a previously streamed <tool_call>...</tool_call> span, the redacted cumulative text becomes shorter than prev_text, so the slice offset is wrong and subsequent user-visible text is dropped or garbled instead of streamed correctly. This shows up precisely in the XML-leak scenario this block is trying to recover from, and can also affect legitimate outputs that include that tag pattern.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits March 31, 2026 07:53
When the model repeats the exact same tool call (same name + arguments)
twice in a row, skip execution and return a redirect message telling it
to try a different approach. This prevents the 8x-repeated-query loops
observed on 27B and 35B models.

When the tool iteration cap (25) is reached, inject a "provide your
final answer now" message before the final streaming pass. This lets
the model synthesize a useful answer from everything it gathered
instead of being silently cut off.

Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs):
- Repeated query runs: 4/10 -> 2/10
- Cap hits: 1/10 -> 0/10
- All 4/4 accuracy: 5/10 -> 7/10
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf9deae02c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2747 to +2751
conversation.append(
{
"role": "user",
"content": (
"You have used all available tool calls. Based on "

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate tool-cap nudge on actual cap exhaustion

This block always appends a synthetic user message after the loop, even when max_tool_iterations is 0 (documented as “disabled” via max_tool_calls_per_message). In that case no tool call happened, but the model still receives “You have used all available tool calls…”, which injects false context and can alter the final answer path for users intentionally disabling tools. Append this nudge only when the loop actually consumed the configured cap.

Useful? React with 👍 / 👎.

The regex fallback for HTML stripping did not match closing tags
with whitespace before the angle bracket (e.g. </script >).
Use \s* before > in both script and style patterns.
Comment thread studio/backend/core/inference/tools.py Fixed
danielhanchen and others added 2 commits March 31, 2026 08:20
- SSRF: resolve hostname via getaddrinfo and reject private, loopback,
  link-local, multicast, and reserved addresses before fetching
- Timeout: handle timeout=None (unlimited mode) in URL fetch path
  by defaulting to 60s instead of crashing on min(None, 60)
- Download cap: read at most max_chars*4+1 bytes instead of the
  full response body before truncating
- XML regex: match both <tool_call> and <function=...> markup in
  the history/stream cleanup (inference.py)
- CodeQL: use [^>]* in closing script/style tags to handle any
  whitespace or attributes before >
- Dedup: track whether each tool call failed so retries after
  transient errors are allowed; only block consecutive identical
  calls that both succeeded
- Final-answer synthesis: guard on max_tool_iterations > 0 so
  callers who disable tools do not get a false "used all calls" turn
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

danielhanchen and others added 3 commits March 31, 2026 01:22
- SSRF redirect bypass: disable auto-redirect in urllib, manually
  follow up to 5 hops with host validation at each step. Prevents
  public URLs from redirecting to loopback/private targets.
- SSE streaming: track prev_text on the raw cumulative and strip
  XML from the delta only, so completed tool_call tags do not cause
  the cumulative to shrink and drop trailing real text.
- Dedup off-by-one: check the immediately previous call (window=1)
  instead of requiring 2 matching history entries, so the second
  identical successful call is blocked rather than the third.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fec5f2cb00

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +215 to +216
def redirect_request(self, req, fp, code, msg, headers, newurl):
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle HTTP redirects without triggering urllib errors

Returning None from _NoRedirect.redirect_request causes urllib to treat 3xx responses as HTTPError via the default error handler, so _fetch_page_text never reaches the if resp.status in (301, ...) redirect branch. In practice, URLs that rely on normal redirects (for example httphttps or canonical-path redirects) will fail with Failed to fetch URL instead of being followed through the intended validated-hop loop.

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/inference/tools.py Outdated
current_url,
headers = {"User-Agent": "UnslothStudio/1.0"},
)
resp = opener.open(req, timeout = timeout)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent DNS rebinding between host check and fetch

The private-range check is done on a preflight DNS lookup, but the subsequent opener.open(req, ...) call performs a fresh resolution when connecting. That leaves a DNS-rebinding window where an attacker-controlled hostname can resolve to a public IP during validation and then to a private/loopback IP during the actual request, bypassing the SSRF protection added here.

Useful? React with 👍 / 👎.

- Redirect fix: urllib raises HTTPError (not a normal response) when
  the redirect handler returns None. Catch HTTPError for 3xx codes
  and extract the Location header from the exception object.
- Error prefixes: remove overly broad "No " prefix that matched
  "No results found." (a valid empty-search outcome, not an error).
  Replace with specific prefixes like "Blocked:", "No query provided",
  "Failed to resolve". This ensures empty search results are correctly
  classified as non-errors for duplicate-call tracking.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c80d30f4fc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1143 to +1147
for _msg in gguf_messages:
if _msg.get("role") == "assistant" and isinstance(
_msg.get("content"), str
):
_msg["content"] = _TOOL_XML_RE.sub("", _msg["content"]).strip()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect auto-heal setting when stripping tool-call XML

This sanitization runs unconditionally in the tool path, so even when auto_heal_tool_calls is false the server still removes <tool_call>...</tool_call> / <function=...>...</function> content from assistant messages (and the same regex is later applied to streamed deltas). That breaks the advertised opt-out behavior and can silently corrupt legitimate outputs that include those tags as literal text, while also altering conversation history before the next model step.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances tool-calling capabilities by increasing the iteration limit, adding duplicate call detection, and extending the web search tool to support direct URL fetching with SSRF protection. It also introduces system prompt nudges to guide model behavior, implements logic to strip leaked tool-call XML, and includes comprehensive benchmark results for GGUF tool calling. Review feedback identifies several optimization opportunities, including moving imports and regex compilations to the module level, removing an unused variable, and refactoring redundant logic to use existing helper functions.

# where the model repeats the exact same call. Retries after
# a transient failure are allowed (only block when the previous
# identical call succeeded).
import hashlib as _hl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Importing hashlib inside the function on every call is inefficient. This import should be moved to the top of the file to follow standard Python practices and improve performance.

References
  1. Imports should be at the top of the file, after any module comments and docstrings, and before module globals and constants. (link)

key = _tool_call_key(name, args)
_tool_call_history.append((key, failed))

_hit_tool_cap = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable _hit_tool_cap is initialized but never used within the function. It should be removed to maintain code cleanliness.

Comment on lines +2601 to +2606
content_accum = re.sub(
r"<tool_call>.*?</tool_call>",
"",
content_accum,
flags = re.DOTALL,
).strip()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This manual regex substitution is redundant and less comprehensive than the _strip_tool_markup helper function defined earlier (line 2160). Using the helper ensures that both \u003ctool_call\u003e and \u003cfunction=...\u003e tags are handled consistently, including cases where tags might be unclosed at the end of the stream. Additionally, _strip_tool_markup respects the auto_heal_tool_calls setting.

                            content_accum = _strip_tool_markup(content_accum, final = True)

Comment thread studio/backend/routes/inference.py Outdated
_has_web = "web_search" in _tool_names
_has_code = "python" in _tool_names or "terminal" in _tool_names

from datetime import date as _date

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Importing date inside the request handler on every call is inefficient. This import should be moved to the top of the file.

References
  1. Imports should be at the top of the file, after any module comments and docstrings, and before module globals and constants. (link)

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +1139 to +1142
_TOOL_XML_RE = _re.compile(
r"<tool_call>.*?</tool_call>|<function=\w+>.*?</function>",
_re.DOTALL,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Compiling the _TOOL_XML_RE regular expression inside the request handler on every call is inefficient. It should be defined as a module-level constant to improve performance, especially given that this regex is also used within the streaming generator.

- SSE streaming: sanitize the full cumulative text before diffing
  against the previous sanitized snapshot, so XML tags that span
  chunk boundaries are stripped correctly. The previous delta-based
  approach leaked split tags.
- DRAINING fallback: use _strip_tool_markup() helper instead of a
  manual regex that only handled <tool_call> but not <function=...>.
- Move hashlib import, _TOOL_XML_RE compile, and datetime import to
  module level per style guide.
- Remove unused _hit_tool_cap variable.
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances tool-calling capabilities by increasing the maximum tool iteration limit to 25, implementing duplicate tool-call detection using hashing, and adding a direct URL fetching feature to the web search tool with SSRF protection. It also introduces system prompt nudges to guide the model's tool usage, adds logic to sanitize leaked tool-call XML from the output, and includes comprehensive benchmark results. Feedback focuses on addressing a potential DNS rebinding vulnerability in the SSRF logic, improving the robustness of duplicate call detection to cover cycles, handling character encoding and HTTP errors more gracefully during page fetches, and refining the HTML-to-text conversion fallback.

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines +202 to +231
ok, reason = _is_public_host(
parsed.hostname,
parsed.port or (443 if parsed.scheme == "https" else 80),
)
if not ok:
return reason

try:
import urllib.request
from urllib.error import HTTPError as _HTTPError
from urllib.parse import urljoin

# Disable auto-redirect so we can validate each hop for SSRF.
# urllib raises HTTPError for 3xx when the handler returns None,
# so we catch that and extract the Location header manually.
class _NoRedirect(urllib.request.HTTPRedirectHandler):
def redirect_request(self, req, fp, code, msg, headers, newurl):
return None

opener = urllib.request.build_opener(_NoRedirect)
max_bytes = max_chars * 4 + 1
current_url = url

for _hop in range(5):
req = urllib.request.Request(
current_url,
headers = {"User-Agent": "UnslothStudio/1.0"},
)
try:
resp = opener.open(req, timeout = timeout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The SSRF protection here is vulnerable to a DNS rebinding attack. _is_public_host resolves the hostname to validate the IP, but opener.open(req) performs its own resolution. A malicious actor could use a DNS server that returns a safe IP for the first resolution and a private/internal IP for the second. To mitigate this, you should resolve the hostname once, validate the IP, and then use that IP address in the request URL while setting the Host header manually. When resolving the hostname, ensure you use socket.getaddrinfo() to support both IPv4 and IPv6 addresses.

References
  1. Use socket.getaddrinfo() to create sockets that support both IPv4 and IPv6 addresses, instead of hardcoding an address family like socket.AF_INET.

Comment on lines +2188 to +2194
def _is_duplicate_call(name: str, args: dict) -> bool:
"""Block if the immediately previous call was identical and succeeded."""
if not _tool_call_history:
return False
key = _tool_call_key(name, args)
last_key, last_failed = _tool_call_history[-1]
return last_key == key and not last_failed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current duplicate call detection only checks the immediately preceding call. Models can sometimes enter cycles involving multiple different tool calls (e.g., A -> B -> A -> B). It would be more robust to check if the current call has appeared anywhere in the history of the current generation turn.

Suggested change
def _is_duplicate_call(name: str, args: dict) -> bool:
"""Block if the immediately previous call was identical and succeeded."""
if not _tool_call_history:
return False
key = _tool_call_key(name, args)
last_key, last_failed = _tool_call_history[-1]
return last_key == key and not last_failed
def _is_duplicate_call(name: str, args: dict) -> bool:
"""Block if this exact call has already succeeded in the current turn."""
if not _tool_call_history:
return False
key = _tool_call_key(name, args)
return any(h_key == key and not h_failed for h_key, h_failed in _tool_call_history)

"process data you already have, or "
"provide your final answer now."
)
_record_tool_call(tool_name, arguments, failed = False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This call to _record_tool_call is redundant because the tool call is recorded again at the end of the loop (line 2738) regardless of whether it was a duplicate or a fresh execution. Removing this redundant call keeps the history clean and avoids double-counting attempts.

Comment thread studio/backend/core/inference/tools.py Outdated
else:
return "Failed to fetch URL: too many redirects."

raw_html = raw_bytes.decode("utf-8", errors = "replace")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding utf-8 for decoding the HTML body may lead to corrupted text for pages using other encodings (e.g., ISO-8859-1). It is better to use the charset provided in the response headers.

Suggested change
raw_html = raw_bytes.decode("utf-8", errors = "replace")
charset = resp.info().get_content_charset() or "utf-8"
raw_html = raw_bytes.decode(charset, errors = "replace")

Comment thread studio/backend/core/inference/tools.py Outdated
Comment on lines +256 to +257
except _HTTPError:
raise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Re-raising _HTTPError here causes the error to be caught by the caller's broad except Exception block in _web_search, resulting in a generic "Search failed" message. It would be better to catch it here, log the exception for debugging purposes, and return a more descriptive error message (including the status code and reason) so the model can understand why the fetch failed.

Suggested change
except _HTTPError:
raise
except _HTTPError as e:
logger.debug(f"HTTP error fetching URL: {e}")
return f"Failed to fetch URL: HTTP {e.code} {e.reason}"
References
  1. Avoid using broad, silent exception handlers like except Exception: pass. Instead, log the exception, even if at a debug level, to aid in future debugging.

Comment on lines +272 to +282
text = _re.sub(
r"<script[^>]*>.*?</script[^>]*>",
"",
raw_html,
flags = _re.DOTALL | _re.IGNORECASE,
)
text = _re.sub(
r"<style[^>]*>.*?</style[^>]*>", "", text, flags = _re.DOTALL | _re.IGNORECASE
)
text = _re.sub(r"<[^>]+>", " ", text)
text = _re.sub(r"\s+", " ", text).strip()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using regular expressions to strip HTML is fragile and can be easily bypassed or produce poor results with complex or malformed HTML. While this is a fallback, consider making html2text a required dependency or using a more robust library like beautifulsoup4 if available in the environment.

danielhanchen and others added 2 commits March 31, 2026 09:52
…e-record

- DNS rebinding: resolve hostname once via getaddrinfo, pin the
  returned IP, rewrite the URL to connect to the pinned IP with
  a Host header. Each redirect hop re-resolves and re-validates.
  Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
  hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
  re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
  the single call at the end of the loop handles all cases.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d41b483688

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +239 to +240
ip_netloc = f"{pinned_ip}:{cp.port}" if cp.port else pinned_ip
pinned_url = urlunparse(cp._replace(netloc = ip_netloc))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep hostname in URL when pinning fetch target IP

Rewriting current_url to use pinned_ip in the URL netloc breaks many valid fetches: for HTTPS, urllib's SNI/certificate hostname checks use the URL host (the IP), not the Host header, so domain certificates no longer match; and if pinned_ip is IPv6, the unbracketed netloc produces an invalid URL parse. This means web_search URL-fetch mode can fail for common dual-stack HTTPS sites.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the tool-calling framework by increasing the maximum iteration cap to 25 and implementing duplicate call detection to prevent infinite loops. It introduces a direct URL fetching feature to the web search tool, accompanied by SSRF protections such as DNS pinning and private IP validation. The update also includes logic to strip leaked tool-call XML from assistant responses, adds system prompt nudges to guide model behavior, and provides benchmark results showing improved performance. I have no feedback to provide.

@danielhanchen danielhanchen merged commit e159b93 into main Mar 31, 2026
5 checks passed
@danielhanchen danielhanchen deleted the studio/improve-gguf-tool-calling branch March 31, 2026 10:06
shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
…#4700)

* studio: improve GGUF tool calling accuracy and reliability

- Add URL fetching to web_search tool so models can read full page
  content instead of only getting search snippets. Uses html2text for
  clean markdown conversion with regex fallback.
- Inject current date and behavioral guidance (URL fetch workflow,
  no repeated queries, use code for data processing) into the
  tool-use system prompt.
- Append error recovery nudge to tool results that indicate failure,
  helping small models avoid looping on the same broken call.
- Strip leaked <tool_call> XML from assistant messages in conversation
  history and from the outgoing SSE stream.
- Raise default max tool iterations from 10 to 25 across backend,
  model schema, and frontend defaults.
- Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain
  enough content for the model to extract useful information.
- Add "IMPORTANT: These are only short snippets" hint to search
  results so models know to fetch full pages when needed.

Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after:
- XML leaks in responses: 10/10 -> 0/10
- URL fetch usage: 0 -> 4/10 runs
- Runs producing actual correct answers: 0/10 -> 2/10
- Average tool calls per query: 5.5 -> 3.8 (more efficient)
- Average response time: 12.3s -> 9.8s

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add tool calling benchmark results across model sizes and quants

Tested 16 configurations (4 models x 2 quants x 2 KV cache types)
with 10 runs each on NVIDIA B200.

Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4
correct songs, 0 XML leaks, 131s average response time.

* Add duplicate tool-call detection and final-answer synthesis

When the model repeats the exact same tool call (same name + arguments)
twice in a row, skip execution and return a redirect message telling it
to try a different approach. This prevents the 8x-repeated-query loops
observed on 27B and 35B models.

When the tool iteration cap (25) is reached, inject a "provide your
final answer now" message before the final streaming pass. This lets
the model synthesize a useful answer from everything it gathered
instead of being silently cut off.

Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs):
- Repeated query runs: 4/10 -> 2/10
- Cap hits: 1/10 -> 0/10
- All 4/4 accuracy: 5/10 -> 7/10

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix CodeQL alert: handle whitespace in script/style closing tags

The regex fallback for HTML stripping did not match closing tags
with whitespace before the angle bracket (e.g. </script >).
Use \s* before > in both script and style patterns.

* Address reviewer findings: SSRF, timeout crash, XML regex, dedup

- SSRF: resolve hostname via getaddrinfo and reject private, loopback,
  link-local, multicast, and reserved addresses before fetching
- Timeout: handle timeout=None (unlimited mode) in URL fetch path
  by defaulting to 60s instead of crashing on min(None, 60)
- Download cap: read at most max_chars*4+1 bytes instead of the
  full response body before truncating
- XML regex: match both <tool_call> and <function=...> markup in
  the history/stream cleanup (inference.py)
- CodeQL: use [^>]* in closing script/style tags to handle any
  whitespace or attributes before >
- Dedup: track whether each tool call failed so retries after
  transient errors are allowed; only block consecutive identical
  calls that both succeeded
- Final-answer synthesis: guard on max_tool_iterations > 0 so
  callers who disable tools do not get a false "used all calls" turn

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix redirect SSRF, SSE streaming regression, dedup off-by-one

- SSRF redirect bypass: disable auto-redirect in urllib, manually
  follow up to 5 hops with host validation at each step. Prevents
  public URLs from redirecting to loopback/private targets.
- SSE streaming: track prev_text on the raw cumulative and strip
  XML from the delta only, so completed tool_call tags do not cause
  the cumulative to shrink and drop trailing real text.
- Dedup off-by-one: check the immediately previous call (window=1)
  instead of requiring 2 matching history entries, so the second
  identical successful call is blocked rather than the third.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix redirect HTTPError handling and tighten error prefixes

- Redirect fix: urllib raises HTTPError (not a normal response) when
  the redirect handler returns None. Catch HTTPError for 3xx codes
  and extract the Location header from the exception object.
- Error prefixes: remove overly broad "No " prefix that matched
  "No results found." (a valid empty-search outcome, not an error).
  Replace with specific prefixes like "Blocked:", "No query provided",
  "Failed to resolve". This ensures empty search results are correctly
  classified as non-errors for duplicate-call tracking.

* Fix SSE cross-chunk XML leaks, cleanup review findings

- SSE streaming: sanitize the full cumulative text before diffing
  against the previous sanitized snapshot, so XML tags that span
  chunk boundaries are stripped correctly. The previous delta-based
  approach leaked split tags.
- DRAINING fallback: use _strip_tool_markup() helper instead of a
  manual regex that only handled <tool_call> but not <function=...>.
- Move hashlib import, _TOOL_XML_RE compile, and datetime import to
  module level per style guide.
- Remove unused _hit_tool_cap variable.

* Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record

- DNS rebinding: resolve hostname once via getaddrinfo, pin the
  returned IP, rewrite the URL to connect to the pinned IP with
  a Host header. Each redirect hop re-resolves and re-validates.
  Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
  hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
  re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
  the single call at the end of the loop handles all cases.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants