Studio: re-introduce multi-format tool calling with parser bug fixes#5811
Studio: re-introduce multi-format tool calling with parser bug fixes#5811danielhanchen wants to merge 4 commits into
Conversation
Resubmits the work from #5615 (reverted in #5619) with the parser / route bug fixes that were subsequently developed on #5620 folded in. The healing-parity package in #5620 -- the GGUF canonical heal key in `llama_cpp.py` and the safetensors_agentic re-prompt loop -- is deliberately left to #5620 so this redo stays scoped to the same 4 files that #5615 originally touched. Adds multi-format tool-call parsing for the safetensors / MLX agentic loop so Llama-3, Llama-3.2 bare JSON, Mistral pre-v11 / v11+ / Ministral, and Gemma 4 tool emissions are normalised to OpenAI shape instead of leaking as prose, plus a route-layer strip that removes the same shapes from streamed and non-streamed completions. Formats: Qwen / Hermes <tool_call>{json}</tool_call> Qwen3.5 / Hermes <function=name><parameter=k>v</parameter></function> Llama-3 built-in <|python_tag|>NAME.call(k="v", ...) Llama-3 custom <|python_tag|>{"name":..., "parameters":...} Llama-3.2 bare {"name":..., "parameters":...} (no marker) Mistral pre-v11 [TOOL_CALLS] [{"name":..., "arguments":...}, ...] Mistral v11+ [TOOL_CALLS]name{json} (may chain) Ministral / Large 3 [TOOL_CALLS]name[ARGS]{json} Gemma 4 <|tool_call>call:NAME{k:<|"|>v<|"|>}<tool_call|> The four parser bugs that motivated the revert are fixed here: 1. Mistral nested-JSON truncation. The closed-pair Mistral regex `\[TOOL_CALLS\]...\{.*?\}` was non-greedy on `}`, so `[TOOL_CALLS]search{"filters":{"date":"2024"},"q":"foo"}` was stripped only up to the inner `}`, leaking `,"q":"foo"}` to the user. Replaced with `_strip_mistral_closed_calls` + the balanced-brace / balanced-bracket helpers that ignore braces inside JSON strings. 2. `<|python_tag|>` stop-on-`<`. The route-layer strip clause `<\|python_tag\|>[^\n<]*` stopped at any literal `<`, so `<|python_tag|>python.call(code="if x < 10: pass")` was sliced to `< 10: pass")`. Replaced with `<\|python_tag\|>(?:[^<]|<(?!\|))*` so the strip consumes any character that is not a Llama-3 `<|sentinel|>` start -- literal `<`, newlines, and embedded JSON all stay inside. 3. Llama-3 sentinel single-pass loop. The fixed-order `for sentinel in (...)` loop in the bare-JSON parser silently dropped calls when the stream contained `<|eot_id|><|begin_of_text|>{json}` because `begin_of_text` was tested before `eot_id` consumed its prefix. Replaced with a `while True / matched` loop so the order of sentinels in the stream no longer matters. 4. UTF-8 corruption in Llama-3 KV decoder. `bytes(s, "utf-8").decode("unicode_escape")` mangles non-ASCII bytes (`"café日本"` -> `'caféæ\x97¥æ\x9c¬'`). Replaced with `json.loads('"' + value + '"')` which handles `\n` / `\t` / `\uXXXX` escapes correctly while preserving literal UTF-8 bytes (emoji, CJK, etc.). `_TOOL_XML_RE` keeps the orphan-handling clauses that #5735 added for the speculative buffer leak shapes (closed pair OR orphan-open to EOF, bare orphan close, tail-only `</parameter>`) so the route layer continues to strip in-flight tool markup as well as the multi-format closed pairs. The new `_strip_tool_xml(text)` helper composes `_TOOL_XML_RE` with `_strip_mistral_closed_calls` so the Mistral nested-JSON shape gets balanced-brace handling at every call site (8 sites updated). Capability gating in `_detect_safetensors_features` now allows templates whose tool-call format is any of the seven supported markers; the gate still suppresses `supports_tools` for templates that advertise tools but use a shape the parser cannot honour, so the UI never enables a pill the loop will not return. Tests in scope: - tests/test_safetensors_tool_loop.py: full multi-format parser coverage (Qwen/Hermes, Llama-3 python_tag and bare JSON, Mistral all variants, Gemma 4), plus `TestRoutesPythonTagStrip` (8 tests) pinning the multi-line / less-than-in-code / sentinel-stop behaviour of bug 2's regex. - tests/test_safetensors_capability_advertise.py: capability gate keeps tools enabled for Llama-3 / Mistral / Gemma 4 / Llama-3.2 bare-JSON templates while still suppressing tools for unknown emission formats. Tests deliberately out of scope (they belong to #5620 because they exercise `llama_cpp.py` / `safetensors_agentic.py`): - TestLoopRePrompt (6) -- safetensors_agentic re-prompt loop. - TestLoopCanonicalHealKey (3) -- canonical heal key under loop. - TestGGUFSafetensorsHealingParity (5) -- GGUF / safetensors parity assertions on shared constants and `_MAX_REPROMPTS`. `pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q` -> 104 passed.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request expands the backend-neutral tool-call parser to support a wide range of model emission formats, including Llama-3, Llama-3.2 bare JSON, Mistral, and Gemma 4. It also updates the safetensors feature detection and streaming response stripping to handle these new formats, backed by comprehensive test coverage. A critical infinite loop vulnerability was identified in the Gemma 4 parser (_gemma_parse_value) when handling malformed list inputs, which could lead to a Denial of Service. A fix was suggested to ensure the index always progresses.
| end = i | ||
| while ( | ||
| end < len(text) | ||
| and text[end] not in ",}]" | ||
| and not text.startswith(_GEMMA_STR_BEGIN, end) | ||
| ): | ||
| end += 1 | ||
| raw = text[i:end].strip() |
There was a problem hiding this comment.
Infinite Loop Vulnerability in Gemma 4 Parser
There is a critical infinite loop vulnerability in _gemma_parse_value when parsing malformed or unexpected input inside list/array brackets.
If the input contains an unexpected closing character (like } or ]) at the start of a primitive value (e.g., [1} or [1, ]), the while loop condition text[end] not in ",}]" is immediately False. As a result, end remains equal to i, and _gemma_parse_value returns ("", i) without progressing the index.
When called inside the list parsing loop (while k < len(body):), this causes k to never increment, leading to an infinite loop that hangs the backend process at 100% CPU usage.
To prevent this, we must guarantee that _gemma_parse_value always consumes at least one character and progresses the index when i < len(text).
| end = i | |
| while ( | |
| end < len(text) | |
| and text[end] not in ",}]" | |
| and not text.startswith(_GEMMA_STR_BEGIN, end) | |
| ): | |
| end += 1 | |
| raw = text[i:end].strip() | |
| end = i | |
| while ( | |
| end < len(text) | |
| and text[end] not in ",}]" | |
| and not text.startswith(_GEMMA_STR_BEGIN, end) | |
| ): | |
| end += 1 | |
| if end == i and i < len(text): | |
| end = i + 1 | |
| raw = text[i:end].strip() |
Three label corrections surfaced while cross-checking the parser
against official HF chat templates. Comment-only, no behaviour
change; 104/104 tests still pass.
1. ``<function=...>`` XML was labelled "Qwen3.5 xml". The
canonical emitter is Qwen3-Coder and it always wraps the
<function=...> block inside an outer <tool_call>...</tool_call>,
never bare. Updated labels in the module docstring and the
three inline comments.
2. Mistral grouping was wrong on two counts: Ministral-8B-2410
uses Tekken V3 and emits the ``[TOOL_CALLS] [...]`` array
form, not ``[ARGS]``. Mistral-Large-2411 ships
tokenizer.model.v7 (no ``[ARGS]`` token) and also emits the
array form. ``[ARGS]`` only enters with Tekken V13 (Devstral,
Magistral-Small-2509). Regrouped the doc lines accordingly.
3. Gemma 4 ``<|tool_call>`` is forward-looking. Neither
gemma-3-12b-it nor gemma-3n-E4B-it emit this shape in their
chat templates today. Noted that the capability gate
correctly suppresses the tools pill on real Gemma 3
templates, so the parser stays as ready-when-Google-ships
scaffolding.
|
Cross-checked this PR against the canonical tool-call parsers in llama.cpp ( Bug fixes confirmed independently
Custom regression suite (56 assertions across bug-fix + adversarial + capability + streaming cases): all pass on this branch, 5/8 fail on Parity per family
Docstring corrections pushed in
|
| Format | llama.cpp | vLLM | SGLang | Notes |
|---|---|---|---|---|
Pythonic / list-of-calls (Llama-3.2 / Llama-4 [fn(k=\"v\")]) |
yes | yes | yes | Flagged 3x. Highest-value follow-up. |
DeepSeek-V3 / V3.1 <|tool▁calls▁begin|>...<|tool▁calls▁end|> |
yes | yes | yes | Flagged 3x. Next priority. |
Granite <|tool_call|>[...] |
yes | yes | partial | Lower priority. |
Phi-4-mini functools[...], GLM-4 bare-name + JSON, Kimi K2, MiniMax-M2, Cohere Command-R |
partial | yes | partial | Lower priority. |
These are all gaps the original #5615 also did not cover, so leaving them out keeps this PR scoped to the same surface area #5615 already shipped.
Test gate
pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q is green at 104 passed. That number is exactly #5620's 118 passing minus the 14 dropped parity tests (TestLoopRePrompt 6 + TestLoopCanonicalHealKey 3 + TestGGUFSafetensorsHealingParity 5), all of which still belong to #5620.
Ready to flip to ready-for-review once the Studio CI matrix is green and someone validates with real models, same gate #5620 holds on.
A fuzz pass turned up that ``_parse_llama3_bare_json`` accepted
``parameters`` as a string, contradicting the docstring's
"parameters or arguments is a dict" guard. Prose like
``{"name":"foo","parameters":"a sentence"}`` would wrongly fire
the parser, which the agentic loop would then heal into a real
``foo(query="a sentence")`` call.
Tightened guard:
- ``parameters`` must be a dict (Llama-3 spec).
- ``arguments`` may be a dict, or a JSON-encoded string that
decodes to a dict (OpenAI shape, e.g.
``"arguments":"{\"q\":\"x\"}"``). Plain non-JSON strings,
JSON-strings of lists / scalars / null no longer pass.
Added 4 regression tests under TestParserMultiFormat:
- test_llama3_2_bare_json_string_parameters_does_not_fire
- test_llama3_2_bare_json_string_arguments_not_json_does_not_fire
- test_llama3_2_bare_json_string_arguments_json_dict_fires
- test_llama3_2_bare_json_string_arguments_json_non_dict_does_not_fire
Existing tests stay green (104 -> 108 passing) and a 50-case
cross-version fuzz suite passes on Python 3.10 / 3.11 / 3.12 / 3.13.
|
Cross-version fuzz pass on Python 3.10 / 3.11 / 3.12 / 3.13 in clean uv venvs. Pure-stdlib parser ( Fuzz suite (50 assertions across 4 venvs)Loads
Result on every venv: One real finding, fixed in
|
A fuzz pass on PR unslothai#5811 turned up that ``_parse_llama3_bare_json`` accepted ``parameters`` as a string, contradicting the docstring's "parameters or arguments is a dict" guard. Prose JSON like ``{"name":"foo","parameters":"a sentence"}`` would wrongly fire the parser, which the agentic loop would then heal into a real ``foo(query="a sentence")`` call. Same code lives on this branch, so the same fix applies here. Tightened guard: - ``parameters`` must be a dict (Llama-3 spec). - ``arguments`` may be a dict, or a JSON-encoded string that decodes to a dict (OpenAI shape, e.g. ``"arguments":"{\"q\":\"x\"}"``). Plain non-JSON strings or JSON-strings of lists / scalars / null no longer pass. Mirrors the fix landed in PR unslothai#5811 commit 615b860. Adds the same 4 regression tests under TestParserMultiFormat. Existing test suite stays green: 127 -> 131 passing.
What this PR does
Resubmits the work from #5615 (reverted in #5619) with the four parser / route bug fixes developed on #5620 folded in. Same 4-file scope as the original #5615 (
tool_call_parser.py,routes/inference.py, two test files). The healing-parity package in #5620 -- the GGUF canonical heal key inllama_cpp.pyand the safetensors_agentic re-prompt loop -- is deliberately left to #5620 so this redo stays scoped to the same files #5615 touched.Why a redo and not just merging #5620
#5620 is open as DRAFT, holding while real-model end-to-end validation finishes. Splitting the parser bug fixes out into this PR lets the four documented #5615 regressions land sooner without waiting on the healing-parity validation that #5620 is gated on. Once both pieces are in, the studio tool-call behaviour matches what llama-server already normalises for GGUF.
Bug fixes folded in from #5620
The four parser bugs that prompted the #5615 revert are all fixed here:
tool_call_parser.pyMistral closed-pair patterns\[TOOL_CALLS\]...\{.*?\}is non-greedy on}, truncating nested JSON ([TOOL_CALLS]search{"filters":{"date":"2024"},"q":"foo"}leaked,"q":"foo"})_strip_mistral_closed_calls+_balanced_brace_end/_balanced_bracket_endhelpers that ignore braces inside JSON stringsroutes/inference.py_TOOL_XML_RE<|python_tag|>[^\n<]*stops at literal<(<|python_tag|>python.call(code="if x < 10: pass")sliced to< 10: pass"))<|python_tag|>(?:[^<]|<(?!|))*-- consumes any character that is not a Llama-3<|sentinel|>start, so literal<, newlines, and embedded JSON stay inside the striptool_call_parser.pyLlama-3 sentinel loopfor sentinel in (...)silently dropped calls when the stream had<|eot_id|><|begin_of_text|>{json}becausebegin_of_textwas tested beforeeot_idconsumed its prefixwhile True / matchedloop -- order of sentinels in the stream no longer matterstool_call_parser.pyLlama-3 KV decoderbytes(s, "utf-8").decode("unicode_escape")mangles non-ASCII ("café日本"->'caféæ\x97¥æ\x9c¬')json.loads('"' + value + '"')-- handles\n/\t/\uXXXXescapes correctly while preserving literal UTF-8 bytes_TOOL_XML_REkeeps the orphan-handling clauses that #5735 added for the speculative-buffer leak shapes (closed pair OR orphan-open to EOF, bare orphan close, tail-only</parameter>). The new_strip_tool_xml(text)helper composes_TOOL_XML_REwith_strip_mistral_closed_callsat the 8 route-layer call sites so the Mistral nested-JSON shape gets balanced-brace handling everywhere.Out of scope -- still belongs to #5620
These five concerns are NOT included here and stay reserved for #5620:
llama_cpp.py({"query": raw_args}-> per-tool dict lookup) -- bug 5 from the studio: tool calling for Llama-3, Mistral, Gemma 4 on safetensors + MLX #5615 post-mortem.safetensors_agentic.pyre-prompt loop (_MAX_REPROMPTS = 3) -- nudges the model when it emits intent without calling a tool.TestLoopRePrompt(6 tests) -- exercises the re-prompt loop.TestLoopCanonicalHealKey(3 tests) -- exercises the canonical heal key under the loop.TestGGUFSafetensorsHealingParity(5 tests) -- pins shared constants between GGUF and safetensors paths.Capability gating
_detect_safetensors_featuresallows templates whose tool-call format is any of the seven supported markers. The gate still suppressessupports_toolsfor templates that advertise tools but use a shape the parser cannot honour, so the UI never enables a pill the loop will not return.Tests
pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py -q-> 104 passed.That number is exactly
#5620's 118 passing - 14 dropped tests(TestLoopRePrompt6 +TestLoopCanonicalHealKey3 +TestGGUFSafetensorsHealingParity5).Coverage matrix:
TestParser/TestParserMultiFormat-- parser shape coverage for all seven formats (Qwen/Hermes, Llama-3 python_tag and bare JSON, Mistral pre-v11 / v11+ / Ministral, Gemma 4). These tests are what catch regressions of bugs 1, 3, 4 -- bugs 1 and 3 surface as wrong parse output on representative inputs, bug 4 surfaces as garbled non-ASCII argument values.TestRoutesPythonTagStrip(8 tests) -- pins bug 2's regex on single-line, less-than-in-code, multi-line, multi-line-with-less-than, EOM sentinel stop, EOT sentinel stop, JSON-form multi-line, and "EOM then trailing python_tag" cases.TestLoopBasic/TestLoopBehaviour/TestLoopControl/TestStatusFormatting/TestProseMentioningToolCall/TestChatTemplateHelper/TestGuardrails/TestGptOssNameDetection-- existing safetensors loop coverage.test_safetensors_capability_advertise.py-- capability gate keeps tools enabled for Llama-3 / Mistral / Gemma 4 / Llama-3.2 bare-JSON templates while still suppressing tools for unknown emission formats.Draft status
Open as draft to mirror #5620's caution. Flip to ready after the CI matrix is green and someone validates with real models (
studio-backend-ci,studio-api-smoke,studio-inference-smoke, the mac / windows variants,mlx-ci,consolidated-tests-ci,lint-ci). This is the same validation gate that #5620 is currently holding on.Refs: #5615 (reverted) , #5619 (the revert) , #5620 (open draft, source of the bug fixes folded in here).