UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support by loci-dev · Pull Request #1182 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-16T03:08:53Z

Note

Source pull request: ggml-org/llama.cpp#19635

Summary

Step-3.5-Flash (196B MoE) uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed.

Reported by multiple users in #19283:

Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses.

Changes

Relax Qwen3-Coder XML detection to only require the 3 shared markers (<tool_call>, <function=, <parameter=)
Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
Add thinking_forced_open support to Qwen3-Coder-XML init function (same pattern as Nemotron v3, Hermes 2 Pro, Granite)
Add <think>/</think> to preserved tokens
Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls when tool_choice=required (also fixes pre-existing bug in MiniMax-M2 and GLM 4.5)
Add unmodified HuggingFace chat template and format detection test

Testing

test-chat passes (format detected as COMMON_CHAT_FORMAT_QWEN3_CODER_XML)
Validated tool calling via /v1/chat/completions: correct tool_calls with parsed arguments
Validated thinking via /v1/chat/completions: reasoning_content properly separated from content
Tested on Step-3.5-Flash IQ3_XXS with --reasoning-format auto (default)

AI Disclosure

Claude was used for codebase exploration, pattern identification, and drafting. All changes follow established patterns from existing format handlers (Nemotron v3, Hermes 2 Pro, MiniMax-M2). Fully tested locally.

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org/llama.cpp#19283

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format.

Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.

loci-dev had a problem deploying to PROD__AL_DEMO February 16, 2026 03:08 — with GitHub Actions Failure

jesseposner added 3 commits February 15, 2026 23:11

loci-dev force-pushed the main branch 4 times, most recently from 073bd79 to 823244c Compare February 18, 2026 02:17

loci-dev force-pushed the loci/pr-19635-fix-step35-tool-call-detection branch from e26fa44 to bdc1dda Compare February 18, 2026 02:17

loci-dev had a problem deploying to PROD__AL_DEMO February 18, 2026 02:17 — with GitHub Actions Error

loci-dev force-pushed the main branch 10 times, most recently from 2cecc98 to a92fe2a Compare February 26, 2026 02:16

loci-dev force-pushed the main branch 9 times, most recently from 9f4f332 to 4298c74 Compare March 6, 2026 02:17

loci-dev force-pushed the main branch from 4298c74 to 0db6c47 Compare March 7, 2026 02:16

loci-dev force-pushed the main branch 8 times, most recently from 56aaa36 to 21147c2 Compare March 13, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support#1182

UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support#1182
loci-dev wants to merge 3 commits intomainfrom
loci/pr-19635-fix-step35-tool-call-detection

loci-dev commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 16, 2026

Summary

Changes

Testing

Related

AI Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants