studio: tool calling for DeepSeek (R1/V3/V3.1), GLM 4.x, Kimi K2 on safetensors + MLX#5624
studio: tool calling for DeepSeek (R1/V3/V3.1), GLM 4.x, Kimi K2 on safetensors + MLX#5624danielhanchen wants to merge 19 commits into
Conversation
Adds three more emission-family parsers to tool_call_parser.py so the
shared safetensors / MLX / GGUF agentic loop covers the major open-
weight reasoning families. Patterns ported from llama.cpp
(common/chat-parser.cpp legacy pre-PEG branch), vLLM
(tool_parsers/deepseekv3*, glm4_moe, kimi_k2), and SGLang
(function_call/deepseekv31_detector, glm4_moe_detector, kimik2_detector).
All three references are MIT (llama.cpp) or Apache-2.0 (vLLM, SGLang).
Formats covered:
DeepSeek R1 <|tool▁calls▁begin|><|tool▁call▁begin|>function
<|tool▁sep|>NAME\n```json\n{...}\n```<|tool▁call▁end|>
<|tool▁calls▁end|>
-- args wrapped in a Markdown json fence, ``function``
literal prefix per llama.cpp common_chat_parse_
deepseek_r1 (chat-parser.cpp:801-820)
DeepSeek V3/V3.1
<|tool▁calls▁begin|><|tool▁call▁begin|>NAME
<|tool▁sep|>{json}<|tool▁call▁end|><|tool▁calls▁end|>
-- bare JSON, no code fence, no ``function`` prefix
per llama.cpp common_chat_parse_deepseek_v3_1
(chat-parser.cpp:822-879)
GLM 4.5/4.6/4.7 <tool_call>NAME\n<arg_key>k1</arg_key>
\n<arg_value>v1</arg_value>...</tool_call>
-- strings raw, non-strings JSON-encoded per
chat_template.jinja; multi-call is back-to-back
blocks. Per llama.cpp common_chat_parse_glm_4_5
(chat-parser.cpp:1040-1052)
Kimi K2 <|tool_calls_section_begin|><|tool_call_begin|>
functions.NAME:IDX<|tool_call_argument_begin|>{json}
<|tool_call_end|><|tool_calls_section_end|>
-- bare name recovered by stripping ``functions.``
prefix and ``:IDX`` suffix; full id preserved as
tool_calls[i].id so the roundtrip replays verbatim.
Per llama.cpp common_chat_parse_kimi_k2
(chat-parser.cpp:896-913)
Marker collisions
GLM uses the same ``<tool_call>`` opener as Qwen but with a bare
function name + ``<arg_key>`` body (Qwen has ``\s*{`` after the tag).
The dispatch keeps Qwen first; Qwen's _TC_JSON_START_RE returns no
matches on a GLM emission, so the fall-through to _parse_glm_tool_
calls handles it correctly. Existing Qwen tests confirm zero
regression.
Streaming buffer
TOOL_XML_SIGNALS extended from 5 markers to 12 so the BUFFERING state
machine wakes on every new family's section opener. Added the
DeepSeek alternative markers (ASCII underscores, short ``<|tool▁calls|>``
form) because real checkpoints emit those variants.
Strip patterns
_TOOL_CLOSED_PATS adds DeepSeek envelope (``<|tool▁calls▁begin|>...
<|tool▁calls▁end|>``) and Kimi section (``<|tool_calls_section_begin|>
...<|tool_calls_section_end|>``). _TOOL_ALL_PATS adds the same plus
the unclosed-tail variants so a truncated stream does not leak
markup.
Route gate
_detect_safetensors_features._PARSER_MARKERS grows to include
DeepSeek and Kimi markers plus ``<arg_key>`` (the unique GLM signal).
_TOOL_XML_RE (the route-layer markup-strip regex) gets DeepSeek and
Kimi closed-pair patterns. _TOOL_TEMPLATE_MARKERS in llama_cpp.py
adds ``message['role'] == 'tool'``, ``message['tool_calls']``, and
``tool_calls is defined`` so the classifier recognises DeepSeek's
subscripted-access template style (it has no top-level
``{% if tools %}`` block).
Tests (39 new):
TestParserDeepSeek (7) -- R1 fence, short-form opener, V3.1 bare,
multi-call, with-reasoning, strip,
signal-wakes-streaming
TestParserGLM (6) -- single, mixed types, multi-call,
unclosed-heal, no-Qwen-regression, strip
TestParserKimi (6) -- single, multi-call, dotted-name, unclosed,
strip, signal-wakes-streaming
TestParserCrossFormatRouting (2) -- dispatch routing, signal coverage
TestLoopBasic loop integration (3) -- DeepSeek / GLM / Kimi end-to-end
Capability advertise (3) -- DeepSeek / GLM / Kimi templates flip
supports_tools=True
All 398 targeted tests pass locally (115 safetensors + 27 capability
+ rest of tool / inference / sandbox / model-config suites). Builds
on PR #5620 (parser + healing parity for Llama-3 / Mistral / Gemma 4);
will rebase cleanly onto main once #5620 lands. PR opened as draft -
do not merge until validated against real models for each family.
Sources
- llama.cpp common/chat-parser.cpp lines 801-913, 1040-1052 (MIT)
- vLLM vllm/tool_parsers/deepseekv31_tool_parser.py (Apache-2.0)
- vLLM vllm/tool_parsers/glm4_moe_tool_parser.py (Apache-2.0)
- vLLM vllm/tool_parsers/kimi_k2_tool_parser.py (Apache-2.0)
- SGLang python/sglang/srt/function_call/{deepseekv31,glm4_moe,kimik2}_
detector.py (Apache-2.0)
- Live chat templates: deepseek-ai/DeepSeek-V3.1, zai-org/GLM-4.6,
moonshotai/Kimi-K2-Instruct, unsloth/DeepSeek-V3-0324,
unsloth/GLM-4.5-Air, unsloth/Kimi-K2-Instruct
Mirrors PR unslothai#5624: three more emission-family parsers for the shared tool_call_parser plus CI updates that exercise the new fixtures cross-OS. - DeepSeek R1 / V3 / V3.1: <|tool▁calls▁begin|>...<|tool▁sep|>... - GLM 4.5 / 4.6 / 4.7: <tool_call>NAME\n<arg_key>K</arg_key>\n <arg_value>V</arg_value>...</tool_call> - Kimi K2 / Moonshot: <|tool_calls_section_begin|>...<|tool_call_ argument_begin|>... Ported from llama.cpp common/chat-parser.cpp lines 801-913, 1040-1052 (MIT), vLLM tool_parsers/ {deepseekv31, glm4_moe, kimi_k2}_tool_parser.py (Apache-2.0), and SGLang function_call/ {deepseekv31, glm4_moe, kimik2}_detector.py (Apache-2.0). CI multi-format probe extended from 9 to 13 fixtures so all four new families run on ubuntu / macos-14 / windows.
There was a problem hiding this comment.
Code Review
This pull request adds tool call parsing support for DeepSeek (R1, V3, V3.1), GLM (4.5, 4.6, 4.7), and Kimi K2 models. The implementation includes new regex patterns, specialized parsing functions, and updates to the inference routing logic and test suites. Reviewers suggested making the DeepSeek R1 parser more robust by handling leading whitespace before JSON blocks and recommended moving the ast import to the top level to avoid performance overhead during streaming inference.
| json_start = m.end() | ||
| # Walk a balanced ``{`` even if the trailing fence is truncated. | ||
| if json_start >= len(body) or body[json_start] != "{": |
There was a problem hiding this comment.
The DeepSeek R1 path should skip any leading whitespace between the Markdown code fence and the start of the JSON object, similar to the V3 and Kimi parsers. This makes the parser more robust against variations in model output formatting.
| json_start = m.end() | |
| # Walk a balanced ``{`` even if the trailing fence is truncated. | |
| if json_start >= len(body) or body[json_start] != "{": | |
| json_start = m.end() | |
| # Skip any whitespace before the JSON. | |
| while json_start < len(body) and body[json_start] in " \t\n\r": | |
| json_start += 1 | |
| # Walk a balanced { even if the trailing fence is truncated. | |
| if json_start >= len(body) or body[json_start] != "{": |
| except (json.JSONDecodeError, ValueError): | ||
| pass | ||
| try: | ||
| import ast as _ast |
Four fixes addressing review of the parent commit: 1. GLM <arg_value> coercion: tighten the json.loads -> ast.literal_eval -> raw cascade to only deserialize when the body unambiguously looks like a JSON literal (object, array, JSON-encoded string, true/false/null, or numeric). Strings like ``True`` / ``None`` (Python literals, not JSON) and arbitrary prose now stay raw. The bare-numeric / bare-boolean ambiguity with string args remains an inherent limitation of the template without schema access -- documented in the new comment. Drops the ast import entirely (closes Gemini's :1036 suggestion). 2. Kimi K2 bare-counter ids (e.g. ``<|tool_call_begin|>3``) are now dropped rather than surfaced as a tool literally named "3". Matches vLLM behaviour; SGLang's schema-infer fallback is out of scope at the parse site. Real Kimi K2 emissions use ``functions.NAME:IDX`` so this is the exception path. 3. Restore the elaborate ``<|python_tag|>(?:[^<]|<(?!\|))*`` clause in routes.inference._TOOL_XML_RE -- the simpler ``[^\n<]*`` form regressed PR #5620's multi-line / literal-``<`` python_tag fix. Restore ``TestRoutesPythonTagStrip`` (8 tests) adapted to call ``_TOOL_XML_RE.sub`` directly since the ``_strip_tool_xml`` helper was inlined this PR. 4. Add the spaced and backslash-escaped DeepSeek opener variants (``<|tool calls begin|>``, ``<|tool\_calls\_begin|>``) to ``TOOL_XML_SIGNALS`` for streaming-gate parity with ``_DEEPSEEK_BEGIN_RE``. Also updates the llama.cpp / vLLM citations in the parser docstrings: ``common/chat-parser.cpp`` was split into ``common/chat.cpp`` + ``common/chat-peg-parser.cpp`` by llama.cpp PR #18675, and vLLM moved the tool parsers from ``vllm/entrypoints/openai/tool_parsers/`` to ``vllm/tool_parsers/``. Pin to pre-refactor commit ``51fa458a92d6`` where the cited line numbers still resolve. New regression tests in ``test_pr5624_regressions.py`` cover the GLM coercion heuristic shapes, GLM literal-``<`` in arg_value, Kimi K2 dotted name, Kimi K2 bare-counter drop, DeepSeek V3.1 truncated mid-stream, and routes-layer strip across all three new families. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 170 passed in 1.91s
…k-glm-kimi Resolves three conflicts against the updated 5620 base (which itself merged main after main moved on with PRs #5735 / #5775 / #5803 etc. touching the same routes/inference.py and tool_call_parser.py surface): * studio/backend/core/inference/tool_call_parser.py ``_TOOL_CLOSED_PATS``: kept 5624's full set (Mistral pre-v11 array, Mistral v11+ name{json}, DeepSeek envelope, Kimi section) on top of the 3-pattern base. New 5620 base reverted to the 3 base patterns because main never carried the tool-format extensions. * studio/backend/routes/inference.py Merged the two regex bodies: kept 5624's elaborate python_tag ``(?:[^<]|<(?!\|))*`` clause and the new-family closed-pair patterns (DeepSeek envelope, Kimi section). DROPPED the inlined Mistral patterns in favour of the base's ``_strip_tool_xml`` helper which delegates Mistral handling to the parser module's ``_strip_mistral_closed_calls`` -- the non-greedy ``\{.*?\}`` form truncates at the first ``}`` of a nested JSON arg, so balanced brace/bracket scanning is correct here. Also kept the base's orphan-close and tail-only ``</parameter>`` patterns from the speculative-buffer split boundary work. Net: 9 call sites continue to use ``_strip_tool_xml(...)`` (Mistral-safe). * studio/backend/tests/test_safetensors_tool_loop.py ``TestRoutesPythonTagStrip``: kept the base's wording for the section header (the two were near-identical) and switched the helper ``_strip`` back to ``_strip_tool_xml`` since the helper is restored. Also retargeted ``test_pr5624_regressions.py``'s routes-layer strip tests to ``_strip_tool_xml`` for consistency with the restored helper. Tests: pytest studio/backend/tests/test_safetensors_tool_loop.py studio/backend/tests/test_safetensors_capability_advertise.py studio/backend/tests/test_pr5624_regressions.py -q -> 170 passed in 1.93s pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration' -> 2034 passed, 15 failed (pre-existing on the 5620 base; same set as before the merge: test_training_worker_flash_attn, test_desktop_auth, test_studio_api integration shims).
for more information, see https://pre-commit.ci
…k-glm-kimi
Second merge pass to absorb the base's "tighten verbose comments in
tool-call parser sections" commit plus pre-commit auto-fixes that
landed on both sides since the previous merge.
Conflicts:
* studio/backend/core/inference/tool_call_parser.py
- Regex prelude: kept 5624's per-family blocks (DeepSeek, GLM,
Kimi) and trimmed comments to the base's tightened style for
Gemma 4. New families need the explicit constants regardless of
comment density.
- parse_tool_calls_from_text: dropped the base's compact for-loop
over the 5 original parsers (it would have double-run Qwen and
function_xml ahead of the explicit chain). Kept 5624's interleaved
dispatch (DeepSeek/Kimi first, Qwen, GLM-after-Qwen,
function_xml, python_tag, Mistral, Gemma, bare-JSON fallback),
with tightened single-line per-family comments.
* studio/backend/routes/inference.py
- _PARSER_MARKERS comment: collapsed to base's tighter style while
keeping the per-family marker list so the next reader knows what
triggers the pill.
- _TOOL_XML_RE comment block: same compression, but the four
speculative-buffer leak shapes and the Mistral balanced-brace
delegation note are preserved.
- _TOOL_XML_RE pattern list: kept all six 5624 alternations
(python_tag elaborate, DeepSeek envelope, Kimi section, plus the
base's orphan-close and tail-only </parameter>) in the order that
matches the comment block.
Tests:
pytest studio/backend/tests/test_safetensors_tool_loop.py
studio/backend/tests/test_safetensors_capability_advertise.py
studio/backend/tests/test_pr5624_regressions.py -q
-> 170 passed in 2.22s
pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
-> 2034 passed, 15 failed (same pre-existing CI gaps on the base:
test_training_worker_flash_attn, test_desktop_auth,
test_studio_api integration shims).
Two fixes surfaced by triple-confirm verification against the live
HF chat templates and upstream llama.cpp / vLLM / SGLang parsers.
1. GLM 4.7 silent drop
``zai-org/GLM-4.7/chat_template.jinja`` line 65 uses
``{{- '<tool_call>' + tc.name -}}`` which Jinja strips trailing
whitespace from, so the first ``<arg_key>`` follows the function
name with NO ``\n`` between them. Real emissions look like
``<tool_call>get_weather<arg_key>city</arg_key><arg_value>London
</arg_value></tool_call>``. The previous ``_GLM_TC_OPEN_RE`` ended
the name with ``\n`` so GLM-4.7 calls were silently dropped
(parser returned ``[]``).
Fix: relax the name terminator to a lookahead that accepts EITHER
``\n`` OR the next ``<arg_key>``:
_GLM_TC_OPEN_RE = re.compile(
r"<tool_call>\s*([^\n<{][^\n<]*?)\s*(?=\n|<arg_key>)"
)
The first-char restriction ``[^\n<{]`` still excludes Qwen's
``<tool_call>{json}`` form so the Qwen-vs-GLM dispatch remains
mutually exclusive.
2. Kimi multi-section parity with vLLM / SGLang
``vllm/tool_parsers/kimi_k2_tool_parser.py`` and SGLang's
``kimik2_detector.py`` both use ``re.findall`` and so collect every
``<|tool_calls_section_begin|>...<|tool_calls_section_end|>`` block
in a single stream. The previous implementation stopped at the
first ``<|tool_calls_section_end|>``. Kimi K2 doesn't emit
multi-section in practice, but parity is cheap.
Fix: wrap the existing per-call body parser in an outer loop that
advances past each ``<|tool_calls_section_end|>`` and continues to
the next ``<|tool_calls_section_begin|>``. Body parsing extracted
to ``_parse_kimi_section_body`` for clarity. Truncated final
section is still surfaced via the existing in-body balanced-brace
walk.
Verified independently against the live HF templates:
* GLM-4.7 emission constructed from the live template parses to the
expected ``{name, arguments}`` shape.
* GLM-4.5 / 4.6 newline shape continues to parse (the lookahead also
matches ``\n``).
* Qwen ``<tool_call>{json}`` still dispatches to the Qwen path -- the
first-char restriction stops the GLM regex from biting JSON bodies.
* Kimi two-section stream surfaces both calls in order with full ids
preserved.
* Bare-counter Kimi ids still drop.
Tests added in ``test_pr5624_regressions.py``:
* ``test_glm_4_7_no_newlines_between_name_and_arg_key``
* ``test_glm_4_7_no_newlines_multi_call``
* ``test_glm_4_7_does_not_break_qwen_path``
* ``test_kimi_two_sections_in_one_stream_both_parse``
pytest studio/backend/tests/test_safetensors_tool_loop.py
studio/backend/tests/test_safetensors_capability_advertise.py
studio/backend/tests/test_pr5624_regressions.py -q
-> 174 passed in 1.93s
pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
-> 2038 passed, 15 failed (pre-existing CI gaps).
for more information, see https://pre-commit.ci
Pure comment / docstring tightening on top of the GLM 4.7 + Kimi
multi-section fixes. No behavioural change.
* Drop multi-paragraph prelude and post-refactor citation chatter in
the DeepSeek, GLM and Kimi parser docstrings; keep the shape and
upstream-commit pin.
* Collapse ``parse_tool_calls_from_text``'s 9 per-family blocks into
a single ordered loop with one combined comment.
* Tighten the GLM coercion, Kimi bare-counter and ``_TOOL_XML_RE``
comments to one or two lines each.
* Same trim pass on ``_PARSER_MARKERS`` and the regression-test
docstrings.
Tests:
pytest studio/backend/tests/test_safetensors_tool_loop.py
studio/backend/tests/test_safetensors_capability_advertise.py
studio/backend/tests/test_pr5624_regressions.py -q
-> 174 passed in 2.00s
Adversarial input ``<|tool▁calls▁begin|><|tool▁call▁begin|>fn<|tool▁sep|>``
followed by a long body that does NOT contain a closing brace caused
the V3 path's ``([^\n<]+?)<|tool▁sep|>`` regex to backtrack
quadratically: at each position the lazy quantifier extends one char
at a time looking for a sep that isn't there, taking ~19s on 50k
chars.
Replace the regex search with ``str.find`` on the sep marker plus a
left-walk to recover the name. ``str.find`` is O(N); the walk stops
on ``\n`` (turn boundary), ``<`` (start of a tag), or ``>`` (end of
an optional ``<|tool▁call▁begin|>`` prefix). Same observable
behaviour as the regex on every canonical input.
Tests:
test_deepseek_v3_1_huge_truncated_body_is_linear (new) -- 50k chars
must parse in < 1s.
pytest studio/backend/tests/test_safetensors_tool_loop.py
studio/backend/tests/test_safetensors_capability_advertise.py
studio/backend/tests/test_pr5624_regressions.py -q
-> 175 passed in 1.97s
pytest studio/backend/tests/ -q -k 'not gpu and not llama_cpp_integration'
-> 2038 passed, 15 pre-existing failures unchanged.
for more information, see https://pre-commit.ci
…to studio-tools-deepseek-glm-kimi
The JSON sub-path of ``_parse_llama3_python_tag`` was fabricating
``{"value": args}`` when the model emitted a non-dict / non-string
``arguments`` value (e.g. ``42``, ``[1,2,3]``, ``null``, ``true``).
This silently turned a malformed emission into a real tool call,
which the agentic loop would then execute with arguments the model
never intended.
Tightened: skip the call instead of fabricating. The same
behaviour now matches the bare-JSON guard tightened earlier
(strict-guard merge from PR #5620, inherited via merge here).
Added a regression test covering the four non-scalar shapes.
Pass count on this branch: 158 -> 159.
Sites in ``_parse_tool_call_json`` and ``_consume_mistral_call``
keep the existing looser behaviour for now; both are reached
only after explicit ``<tool_call>`` / ``[TOOL_CALLS]`` markers
so the false-positive surface there is much narrower.
…a.cpp Four GGUF-parity fixes for the GLM and Kimi K2 families: - GLM 4.7 zero-argument inline call <tool_call>name</tool_call> was dropped: the open-tag lookahead only allowed \n or <arg_key> after the name. Allow </tool_call> too so a no-arg call parses to empty args (vLLM / SGLang / llama.cpp all parse it). - GLM string argument values were stripped, losing significant leading / trailing whitespace in code / diff arguments. Keep the raw value for the string fallback and only strip the copy used to probe for a JSON literal, matching vLLM glm4_moe which never strips string args. - Kimi K2 calls emitted without the <|tool_calls_section_begin|> wrapper were dropped. llama.cpp makes the section optional (Kimi can call a tool straight after reasoning without opening a section); parse a bare <|tool_call_begin|> when no section is present. - Kimi K2 malformed / truncated JSON in one call dropped every later call in the section. Skip the bad call and keep parsing so valid subsequent calls are recovered (vLLM parity). Adds regression tests for all four.
|
Validated this end to end on 4 B200 GPUs, one Studio per model family, running each safetensors model and its Results, safetensors vs GGUF, tools fired out of 18:
So on the families whose chat template actually advertises tools, the safetensors path now matches the GGUF. Where safetensors does not fire, the GGUF does not fire either, so there is no remaining backend gap. I also cross-checked each family's parsing against llama.cpp
All of these pass on the safetensors tool-loop suite. I confirmed the same behaviour in the Studio UI. On Qwen3-14B the "Search the web for the latest stable PyTorch release, then compute the sum of primes below 1000" prompt shows a "2 tool calls" badge, real web_search source chips, and the correct python result of 76,127. On Llama-3.1-8B the weather plus Fibonacci prompt returns real web_search source chips and runs the python tool. One follow-up worth noting separately, unrelated to this parser: for Mistral the Studio swaps in the Unsloth |
…ripped parser)
Gemma-4 safetensors fired no tools while its GGUF fired reliably. Three gaps:
- The Studio swaps in the Unsloth "gemma-4" chat template, which does not
render the tools schema (the model's native template does), so the model
never saw the tools. Fall back to the model's native template when the
override template renders identically with and without tools. Same fix
helps any family whose override template drops tools.
- skip_special_tokens strips the <|tool_call> wrapper and <|"|> string
markers, so a streamed Gemma-4 call arrives as a bare call:NAME{k:v, ...}
with unquoted values. Parse that form, keeping commas/braces inside a
code or command value, normalising surrounding quotes, and stripping the
leaked markup from the final answer.
- Without a grammar a small model can loop, repeating one call for the whole
tool budget. Collapse exact-duplicate calls within a turn and force a final
answer after a turn that made no new tool progress (llama-server's lazy
grammar prevents this loop on the GGUF side).
Adds parser tests for the bare/stripped Gemma-4 form.
for more information, see https://pre-commit.ci
|
Added gemma-4-E4B-it to the comparison, and this one was a real case of the GGUF being better: safetensors fired 0/18 while the GGUF fired 18/18. Three gaps, all fixed in 4127d1f:
After the fix, safetensors vs GGUF over 18 runs each (same prompts, same generation params):
Tool firing now matches the GGUF in both rate and the count and kind of tools called. Two gaps remain and are structural to running a small model without llama.cpp's grammar: the E4B model relays computed values into its final answer less reliably (0/6 vs 4/6 on the deterministic code checks) and takes more turns, so it is slower. In the Studio UI the model fires web_search against real weather sources and runs python, returning "The 30th Fibonacci number is: 832040". Parser tests for the bare and stripped Gemma-4 form are included. The safetensors tool-loop suite is at 157 passing. |
|
One more validation, this time on a new MoE: Qwen3.6-35B-A3B safetensors vs its MTP-GGUF ( Architecture is safetensors vs MTP-GGUF, 18 runs each, same prompts and params:
Both fire on every question, so there is no safetensors gap here. If anything the safetensors path is the more accurate one, passing all 6 deterministic code checks vs 3/6 for the GGUF, and it calls fewer tools (the MTP-GGUF over-searches, averaging about 11 web_search calls on the GitHub-issues question). Confirmed in the Studio UI: the MTP-GGUF fires web_search against real PyTorch sources and runs python, returning the correct sum of primes below 1000 (76,127). |
DRAFT - do not merge. Stacked on PR #5620 (Llama-3 / Mistral / Gemma 4 + healing parity). Validate against real models before merging.