Skip to content

Studio: re-introduce multi-format tool calling with parser bug fixes#5811

Draft
danielhanchen wants to merge 4 commits into
mainfrom
studio-tools-multi-format-v3
Draft

Studio: re-introduce multi-format tool calling with parser bug fixes#5811
danielhanchen wants to merge 4 commits into
mainfrom
studio-tools-multi-format-v3

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

What this PR does

Resubmits the work from #5615 (reverted in #5619) with the four parser / route bug fixes developed on #5620 folded in. Same 4-file scope as the original #5615 (tool_call_parser.py, routes/inference.py, two test files). The healing-parity package in #5620 -- the GGUF canonical heal key in llama_cpp.py and the safetensors_agentic re-prompt loop -- is deliberately left to #5620 so this redo stays scoped to the same files #5615 touched.

Why a redo and not just merging #5620

#5620 is open as DRAFT, holding while real-model end-to-end validation finishes. Splitting the parser bug fixes out into this PR lets the four documented #5615 regressions land sooner without waiting on the healing-parity validation that #5620 is gated on. Once both pieces are in, the studio tool-call behaviour matches what llama-server already normalises for GGUF.

Bug fixes folded in from #5620

The four parser bugs that prompted the #5615 revert are all fixed here:

# Where Bug Fix
1 tool_call_parser.py Mistral closed-pair patterns \[TOOL_CALLS\]...\{.*?\} is non-greedy on }, truncating nested JSON ([TOOL_CALLS]search{"filters":{"date":"2024"},"q":"foo"} leaked ,"q":"foo"}) _strip_mistral_closed_calls + _balanced_brace_end / _balanced_bracket_end helpers that ignore braces inside JSON strings
2 routes/inference.py _TOOL_XML_RE <|python_tag|>[^\n<]* stops at literal < (<|python_tag|>python.call(code="if x < 10: pass") sliced to < 10: pass")) <|python_tag|>(?:[^<]|<(?!|))* -- consumes any character that is not a Llama-3 <|sentinel|> start, so literal <, newlines, and embedded JSON stay inside the strip
3 tool_call_parser.py Llama-3 sentinel loop Fixed-order for sentinel in (...) silently dropped calls when the stream had <|eot_id|><|begin_of_text|>{json} because begin_of_text was tested before eot_id consumed its prefix while True / matched loop -- order of sentinels in the stream no longer matters
4 tool_call_parser.py Llama-3 KV decoder bytes(s, "utf-8").decode("unicode_escape") mangles non-ASCII ("café日本" -> 'caféæ\x97¥æ\x9c¬') json.loads('"' + value + '"') -- handles \n / \t / \uXXXX escapes correctly while preserving literal UTF-8 bytes

_TOOL_XML_RE keeps the orphan-handling clauses that #5735 added for the speculative-buffer leak shapes (closed pair OR orphan-open to EOF, bare orphan close, tail-only </parameter>). The new _strip_tool_xml(text) helper composes _TOOL_XML_RE with _strip_mistral_closed_calls at the 8 route-layer call sites so the Mistral nested-JSON shape gets balanced-brace handling everywhere.

Out of scope -- still belongs to #5620

These five concerns are NOT included here and stay reserved for #5620:

  • GGUF canonical heal key in llama_cpp.py ({"query": raw_args} -> per-tool dict lookup) -- bug 5 from the studio: tool calling for Llama-3, Mistral, Gemma 4 on safetensors + MLX #5615 post-mortem.
  • safetensors_agentic.py re-prompt loop (_MAX_REPROMPTS = 3) -- nudges the model when it emits intent without calling a tool.
  • TestLoopRePrompt (6 tests) -- exercises the re-prompt loop.
  • TestLoopCanonicalHealKey (3 tests) -- exercises the canonical heal key under the loop.
  • TestGGUFSafetensorsHealingParity (5 tests) -- pins shared constants between GGUF and safetensors paths.

Capability gating

_detect_safetensors_features allows templates whose tool-call format is any of the seven supported markers. The gate still suppresses supports_tools for templates that advertise tools but use a shape the parser cannot honour, so the UI never enables a pill the loop will not return.

Tests

pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q -> 104 passed.

That number is exactly #5620's 118 passing - 14 dropped tests (TestLoopRePrompt 6 + TestLoopCanonicalHealKey 3 + TestGGUFSafetensorsHealingParity 5).

Coverage matrix:

  • TestParser / TestParserMultiFormat -- parser shape coverage for all seven formats (Qwen/Hermes, Llama-3 python_tag and bare JSON, Mistral pre-v11 / v11+ / Ministral, Gemma 4). These tests are what catch regressions of bugs 1, 3, 4 -- bugs 1 and 3 surface as wrong parse output on representative inputs, bug 4 surfaces as garbled non-ASCII argument values.
  • TestRoutesPythonTagStrip (8 tests) -- pins bug 2's regex on single-line, less-than-in-code, multi-line, multi-line-with-less-than, EOM sentinel stop, EOT sentinel stop, JSON-form multi-line, and "EOM then trailing python_tag" cases.
  • TestLoopBasic / TestLoopBehaviour / TestLoopControl / TestStatusFormatting / TestProseMentioningToolCall / TestChatTemplateHelper / TestGuardrails / TestGptOssNameDetection -- existing safetensors loop coverage.
  • test_safetensors_capability_advertise.py -- capability gate keeps tools enabled for Llama-3 / Mistral / Gemma 4 / Llama-3.2 bare-JSON templates while still suppressing tools for unknown emission formats.

Draft status

Open as draft to mirror #5620's caution. Flip to ready after the CI matrix is green and someone validates with real models (studio-backend-ci, studio-api-smoke, studio-inference-smoke, the mac / windows variants, mlx-ci, consolidated-tests-ci, lint-ci). This is the same validation gate that #5620 is currently holding on.

Refs: #5615 (reverted) , #5619 (the revert) , #5620 (open draft, source of the bug fixes folded in here).

danielhanchen and others added 2 commits May 27, 2026 13:15
Resubmits the work from #5615 (reverted in #5619) with the parser /
route bug fixes that were subsequently developed on #5620 folded in.
The healing-parity package in #5620 -- the GGUF canonical heal key
in `llama_cpp.py` and the safetensors_agentic re-prompt loop -- is
deliberately left to #5620 so this redo stays scoped to the same
4 files that #5615 originally touched.

Adds multi-format tool-call parsing for the safetensors / MLX agentic
loop so Llama-3, Llama-3.2 bare JSON, Mistral pre-v11 / v11+ /
Ministral, and Gemma 4 tool emissions are normalised to OpenAI shape
instead of leaking as prose, plus a route-layer strip that removes
the same shapes from streamed and non-streamed completions. Formats:

  Qwen / Hermes        <tool_call>{json}</tool_call>
  Qwen3.5 / Hermes     <function=name><parameter=k>v</parameter></function>
  Llama-3 built-in     <|python_tag|>NAME.call(k="v", ...)
  Llama-3 custom       <|python_tag|>{"name":..., "parameters":...}
  Llama-3.2 bare       {"name":..., "parameters":...}    (no marker)
  Mistral pre-v11      [TOOL_CALLS] [{"name":..., "arguments":...}, ...]
  Mistral v11+         [TOOL_CALLS]name{json}            (may chain)
  Ministral / Large 3  [TOOL_CALLS]name[ARGS]{json}
  Gemma 4              <|tool_call>call:NAME{k:<|"|>v<|"|>}<tool_call|>

The four parser bugs that motivated the revert are fixed here:

  1. Mistral nested-JSON truncation. The closed-pair Mistral regex
     `\[TOOL_CALLS\]...\{.*?\}` was non-greedy on `}`, so
     `[TOOL_CALLS]search{"filters":{"date":"2024"},"q":"foo"}` was
     stripped only up to the inner `}`, leaking `,"q":"foo"}` to
     the user. Replaced with `_strip_mistral_closed_calls` + the
     balanced-brace / balanced-bracket helpers that ignore braces
     inside JSON strings.

  2. `<|python_tag|>` stop-on-`<`. The route-layer strip clause
     `<\|python_tag\|>[^\n<]*` stopped at any literal `<`, so
     `<|python_tag|>python.call(code="if x < 10: pass")` was
     sliced to `< 10: pass")`. Replaced with
     `<\|python_tag\|>(?:[^<]|<(?!\|))*` so the strip consumes any
     character that is not a Llama-3 `<|sentinel|>` start --
     literal `<`, newlines, and embedded JSON all stay inside.

  3. Llama-3 sentinel single-pass loop. The fixed-order
     `for sentinel in (...)` loop in the bare-JSON parser silently
     dropped calls when the stream contained
     `<|eot_id|><|begin_of_text|>{json}` because `begin_of_text`
     was tested before `eot_id` consumed its prefix. Replaced with
     a `while True / matched` loop so the order of sentinels in
     the stream no longer matters.

  4. UTF-8 corruption in Llama-3 KV decoder.
     `bytes(s, "utf-8").decode("unicode_escape")` mangles non-ASCII
     bytes (`"café日本"` -> `'caféæ\x97¥æ\x9c¬'`). Replaced with
     `json.loads('"' + value + '"')` which handles
     `\n` / `\t` / `\uXXXX` escapes correctly while preserving
     literal UTF-8 bytes (emoji, CJK, etc.).

`_TOOL_XML_RE` keeps the orphan-handling clauses that #5735 added
for the speculative buffer leak shapes (closed pair OR orphan-open
to EOF, bare orphan close, tail-only `</parameter>`) so the route
layer continues to strip in-flight tool markup as well as the
multi-format closed pairs. The new `_strip_tool_xml(text)` helper
composes `_TOOL_XML_RE` with `_strip_mistral_closed_calls` so the
Mistral nested-JSON shape gets balanced-brace handling at every
call site (8 sites updated).

Capability gating in `_detect_safetensors_features` now allows
templates whose tool-call format is any of the seven supported
markers; the gate still suppresses `supports_tools` for templates
that advertise tools but use a shape the parser cannot honour, so
the UI never enables a pill the loop will not return.

Tests in scope:

  - tests/test_safetensors_tool_loop.py: full multi-format parser
    coverage (Qwen/Hermes, Llama-3 python_tag and bare JSON,
    Mistral all variants, Gemma 4), plus
    `TestRoutesPythonTagStrip` (8 tests) pinning the multi-line /
    less-than-in-code / sentinel-stop behaviour of bug 2's regex.
  - tests/test_safetensors_capability_advertise.py: capability
    gate keeps tools enabled for Llama-3 / Mistral / Gemma 4 /
    Llama-3.2 bare-JSON templates while still suppressing tools
    for unknown emission formats.

Tests deliberately out of scope (they belong to #5620 because they
exercise `llama_cpp.py` / `safetensors_agentic.py`):

  - TestLoopRePrompt (6) -- safetensors_agentic re-prompt loop.
  - TestLoopCanonicalHealKey (3) -- canonical heal key under loop.
  - TestGGUFSafetensorsHealingParity (5) -- GGUF / safetensors
    parity assertions on shared constants and `_MAX_REPROMPTS`.

`pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q`
-> 104 passed.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands the backend-neutral tool-call parser to support a wide range of model emission formats, including Llama-3, Llama-3.2 bare JSON, Mistral, and Gemma 4. It also updates the safetensors feature detection and streaming response stripping to handle these new formats, backed by comprehensive test coverage. A critical infinite loop vulnerability was identified in the Gemma 4 parser (_gemma_parse_value) when handling malformed list inputs, which could lead to a Denial of Service. A fix was suggested to ensure the index always progresses.

Comment on lines +767 to +774
end = i
while (
end < len(text)
and text[end] not in ",}]"
and not text.startswith(_GEMMA_STR_BEGIN, end)
):
end += 1
raw = text[i:end].strip()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

Infinite Loop Vulnerability in Gemma 4 Parser

There is a critical infinite loop vulnerability in _gemma_parse_value when parsing malformed or unexpected input inside list/array brackets.

If the input contains an unexpected closing character (like } or ]) at the start of a primitive value (e.g., [1} or [1, ]), the while loop condition text[end] not in ",}]" is immediately False. As a result, end remains equal to i, and _gemma_parse_value returns ("", i) without progressing the index.

When called inside the list parsing loop (while k < len(body):), this causes k to never increment, leading to an infinite loop that hangs the backend process at 100% CPU usage.

To prevent this, we must guarantee that _gemma_parse_value always consumes at least one character and progresses the index when i < len(text).

Suggested change
end = i
while (
end < len(text)
and text[end] not in ",}]"
and not text.startswith(_GEMMA_STR_BEGIN, end)
):
end += 1
raw = text[i:end].strip()
end = i
while (
end < len(text)
and text[end] not in ",}]"
and not text.startswith(_GEMMA_STR_BEGIN, end)
):
end += 1
if end == i and i < len(text):
end = i + 1
raw = text[i:end].strip()

Three label corrections surfaced while cross-checking the parser
against official HF chat templates. Comment-only, no behaviour
change; 104/104 tests still pass.

  1. ``<function=...>`` XML was labelled "Qwen3.5 xml". The
     canonical emitter is Qwen3-Coder and it always wraps the
     <function=...> block inside an outer <tool_call>...</tool_call>,
     never bare. Updated labels in the module docstring and the
     three inline comments.

  2. Mistral grouping was wrong on two counts: Ministral-8B-2410
     uses Tekken V3 and emits the ``[TOOL_CALLS] [...]`` array
     form, not ``[ARGS]``. Mistral-Large-2411 ships
     tokenizer.model.v7 (no ``[ARGS]`` token) and also emits the
     array form. ``[ARGS]`` only enters with Tekken V13 (Devstral,
     Magistral-Small-2509). Regrouped the doc lines accordingly.

  3. Gemma 4 ``<|tool_call>`` is forward-looking. Neither
     gemma-3-12b-it nor gemma-3n-E4B-it emit this shape in their
     chat templates today. Noted that the capability gate
     correctly suppresses the tools pill on real Gemma 3
     templates, so the parser stays as ready-when-Google-ships
     scaffolding.
@danielhanchen

Copy link
Copy Markdown
Member Author

Cross-checked this PR against the canonical tool-call parsers in llama.cpp (common/chat-parser.cpp + the legacy pre-PEG branch at 34df42f7be), vLLM (vllm/entrypoints/openai/tool_parsers/), SGLang (python/sglang/srt/function_call/), and the official chat templates of the families it claims to support on Hugging Face. Then ran an adversarial simulation suite head-to-head against pr-5615 to confirm the 4 documented regressions are gone.

Bug fixes confirmed independently

Bug Confirmed by
1. Mistral nested-JSON truncation (_strip_mistral_closed_calls + _balanced_brace_end) llama.cpp PEG balanced-scan parity, SGLang mistral_detector parity, simulation regression (truncates on pr-5615, full args preserved on this branch)
2. <|python_tag|> strip stop-on-< ((?:[^<]|<(?!|))*) llama.cpp upstream handling, vLLM never had this bug because it doesn't pre-strip, simulation regression (route-layer leaked < 10: pass\") on pr-5615, fully stripped here)
3. Llama-3 sentinel single-pass loop (while True / matched) llama.cpp, SGLang llama32_detector (PR is strictly stronger here, SGLang only checks text.startswith(\"{\")), simulation (both orderings + triple-sentinel pile work on this branch; pr-5615 drops them)
4. UTF-8 KV decoder (json.loads('\"' + s + '\"')) llama.cpp uses proper JSON decoding, SGLang same, all 4 upstream sources confirm unicode_escape was the wrong tool. Simulation shows café日本🎉 preserved here, mangled to caféæ\x97¥æ\x9c¬ð\x9f\x8e\x89 on pr-5615

Custom regression suite (56 assertions across bug-fix + adversarial + capability + streaming cases): all pass on this branch, 5/8 fail on pr-5615 exactly where they should.

Parity per family

Family llama.cpp vLLM SGLang HF template
Qwen / Hermes <tool_call> byte-perfect parity parity parity (Qwen2.5, Qwen3, Hermes-3, Hermes-4)
Qwen3-Coder XML <function=...> byte-perfect parity parity parity (always nested in outer <tool_call>)
Llama-3 <|python_tag|> (builtin + custom) byte-perfect parity this PR stronger (handles built-in NAME.call(...) form vLLM and SGLang miss) parity (Llama-3.1, Llama-3.3)
Llama-3.2 bare JSON byte-perfect this PR stricter (false-positive-safe via strict guard) this PR stronger (sentinel loop) parity (Llama-3.2)
Mistral V3 array / V11+ / V13 [ARGS] byte-perfect parity parity parity
Gemma 4 <|tool_call>...<tool_call|> byte-perfect on marker this PR preserves native types (vLLM force-casts to string) parity forward-looking, no shipping Gemma 3 model emits this shape yet, but capability gate correctly suppresses the tools pill on real Gemma 3

Docstring corrections pushed in 518d0a5570

Three label fixes surfaced by the HF-template cross-check (comment-only, no code change, 104/104 still pass):

  1. <function=name> XML was labelled Qwen3.5 xml. Canonical emitter is Qwen3-Coder and it always wraps in <tool_call>...</tool_call> (never bare). Relabelled.
  2. Mistral grouping: Ministral-8B-Instruct-2410 was incorrectly grouped with [TOOL_CALLS]name[ARGS]{json}. It ships Tekken V3 and emits the array form. Same for Mistral-Large-Instruct-2411 (ships tokenizer.model.v7, no [ARGS] token). [ARGS] only enters with Tekken V13 (Devstral, Magistral-Small-2509). Regrouped.
  3. Gemma 4 docstring now notes the parser is forward-looking scaffolding; no shipping Gemma 3 model uses this template today.

Architectural notes for reviewers

  • _parse_function_xml returns string-typed args. vLLM and SGLang's Qwen-coder parsers schema-coerce to ints / bools when the tool schema is known. Conscious trade-off because Studio's parser doesn't carry schema at parse time. Flag if any downstream consumer needs typed args.
  • Streaming partials are handled at the route layer via _MAX_BUFFER_CHARS=32 + the TOOL_XML_SIGNALS set (longest signal is 14 chars, well under cap). vLLM and SGLang use per-parser partial JSON state. Architectural fit for Studio's buffer-then-parse route is fine.
  • vLLM serialises args with ensure_ascii=False; this PR uses default (ensure_ascii=True). Doesn't affect parsing, only the string form of arguments going to the OpenAI client. Worth syncing in a follow-up if any client cares.

Follow-up parsers flagged by multiple upstream sources (deliberately out of scope for this PR)

Format llama.cpp vLLM SGLang Notes
Pythonic / list-of-calls (Llama-3.2 / Llama-4 [fn(k=\"v\")]) yes yes yes Flagged 3x. Highest-value follow-up.
DeepSeek-V3 / V3.1 <|tool▁calls▁begin|>...<|tool▁calls▁end|> yes yes yes Flagged 3x. Next priority.
Granite <|tool_call|>[...] yes yes partial Lower priority.
Phi-4-mini functools[...], GLM-4 bare-name + JSON, Kimi K2, MiniMax-M2, Cohere Command-R partial yes partial Lower priority.

These are all gaps the original #5615 also did not cover, so leaving them out keeps this PR scoped to the same surface area #5615 already shipped.

Test gate

pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q is green at 104 passed. That number is exactly #5620's 118 passing minus the 14 dropped parity tests (TestLoopRePrompt 6 + TestLoopCanonicalHealKey 3 + TestGGUFSafetensorsHealingParity 5), all of which still belong to #5620.

Ready to flip to ready-for-review once the Studio CI matrix is green and someone validates with real models, same gate #5620 holds on.

A fuzz pass turned up that ``_parse_llama3_bare_json`` accepted
``parameters`` as a string, contradicting the docstring's
"parameters or arguments is a dict" guard. Prose like
``{"name":"foo","parameters":"a sentence"}`` would wrongly fire
the parser, which the agentic loop would then heal into a real
``foo(query="a sentence")`` call.

Tightened guard:

  - ``parameters`` must be a dict (Llama-3 spec).
  - ``arguments`` may be a dict, or a JSON-encoded string that
    decodes to a dict (OpenAI shape, e.g.
    ``"arguments":"{\"q\":\"x\"}"``). Plain non-JSON strings,
    JSON-strings of lists / scalars / null no longer pass.

Added 4 regression tests under TestParserMultiFormat:

  - test_llama3_2_bare_json_string_parameters_does_not_fire
  - test_llama3_2_bare_json_string_arguments_not_json_does_not_fire
  - test_llama3_2_bare_json_string_arguments_json_dict_fires
  - test_llama3_2_bare_json_string_arguments_json_non_dict_does_not_fire

Existing tests stay green (104 -> 108 passing) and a 50-case
cross-version fuzz suite passes on Python 3.10 / 3.11 / 3.12 / 3.13.
@danielhanchen

Copy link
Copy Markdown
Member Author

Cross-version fuzz pass on Python 3.10 / 3.11 / 3.12 / 3.13 in clean uv venvs. Pure-stdlib parser (json, re, typing), no platform branches, so the result is the same shape on Linux / macOS / Windows. The PR is server-side only (no browser code touched).

Fuzz suite (50 assertions across 4 venvs)

Loads tool_call_parser.py by file path so it runs with zero backend deps. Categories:

  • T1 bug-fix regressions (the 4 documented bugs)
  • T2 false-positive guards on bare JSON
  • T3 Unicode (NFD vs NFC, RTL, ZWJ emoji, surrogates, BOM, CRLF)
  • T4 catastrophic-backtracking probes (< runs, deep nesting, repeated triggers, unclosed tails -- all bounded under 1s)
  • T5 truncation at every byte position for 9 fixture shapes
  • T6 thread-safety (1000 parses across 16 threads, distinct expected names)
  • T7 strip idempotency (strip(strip(x)) == strip(x))
  • T8 1MB prose + 1MB-with-tail-call stress, under 2s
  • T9 multi-call chaining for every family
  • T10 mixed-family pollution (e.g. [TOOL_CALLS] literal inside Qwen arguments)
  • 1000-trial random fuzz + 500-trial marker-injection fuzz, no crashes, no hangs

Result on every venv:

venv_3.10: 50 passed in 0.19s
venv_3.11: 50 passed in 0.15s
venv_3.12: 50 passed in 0.16s
venv_3.13: 50 passed in 0.16s

One real finding, fixed in 615b8608

_parse_llama3_bare_json accepted parameters as a string (e.g. {\"name\":\"foo\",\"parameters\":\"a sentence\"}), contradicting the docstring's strict guard (parameters or arguments is a dict). Prose JSON would wrongly trigger; the agentic loop's heal step would then turn it into foo(query=\"a sentence\").

Tightened guard:

  • parameters must be a dict (Llama-3 spec).
  • arguments may be a dict, or a JSON-encoded string that decodes to a dict (OpenAI shape, e.g. \"arguments\":\"{\\\"q\\\":\\\"x\\\"}\"). Plain non-JSON strings, and JSON-strings of lists / scalars / null no longer pass.

Added four regression tests under TestParserMultiFormat:

  • test_llama3_2_bare_json_string_parameters_does_not_fire
  • test_llama3_2_bare_json_string_arguments_not_json_does_not_fire
  • test_llama3_2_bare_json_string_arguments_json_dict_fires
  • test_llama3_2_bare_json_string_arguments_json_non_dict_does_not_fire

PR suite is 108 passing (was 104, +4 new).

Why merging this PR does not break old paths

  • Pure stdlib (json, re, typing); no hardware path, no UI / browser code, no platform branch.
  • Pre-existing Qwen / Hermes parsers are unchanged, just dispatched first so they still claim those inputs.
  • The route-layer _TOOL_XML_RE kept every clause Studio: strip orphan tool_call XML leaking into visible content #5735 added for orphan-tool-call handling; the new shapes are added as new alternatives, not by changing the old ones.
  • Module-level state is regex / string / tuple constants only; safe to share across threads (T6 confirms).
  • The capability gate still suppresses supports_tools for templates whose format the parser cannot honour, so a model running an unrecognised template never gets a tools pill it cannot fulfil.

Reports + fuzz tests landed at:

  • temp/sim/fuzz_parser.py (50 assertions, no backend deps)
  • temp/sim/venv_3.10..3.13/ (clean uv venvs)
  • async_task_outputs/verify_5811_{llamacpp,vllm,sglang,hf_templates,simulation}.md

rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 27, 2026
A fuzz pass on PR unslothai#5811 turned up that ``_parse_llama3_bare_json``
accepted ``parameters`` as a string, contradicting the docstring's
"parameters or arguments is a dict" guard. Prose JSON like
``{"name":"foo","parameters":"a sentence"}`` would wrongly fire the
parser, which the agentic loop would then heal into a real
``foo(query="a sentence")`` call.

Same code lives on this branch, so the same fix applies here.

Tightened guard:

  - ``parameters`` must be a dict (Llama-3 spec).
  - ``arguments`` may be a dict, or a JSON-encoded string that
    decodes to a dict (OpenAI shape, e.g.
    ``"arguments":"{\"q\":\"x\"}"``). Plain non-JSON strings or
    JSON-strings of lists / scalars / null no longer pass.

Mirrors the fix landed in PR unslothai#5811 commit 615b860. Adds the same
4 regression tests under TestParserMultiFormat.

Existing test suite stays green: 127 -> 131 passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant