UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support#1182
Open
UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support#1182
Conversation
Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org/llama.cpp#19283
Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format.
Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.
073bd79 to
823244c
Compare
e26fa44 to
bdc1dda
Compare
2cecc98 to
a92fe2a
Compare
9f4f332 to
4298c74
Compare
56aaa36 to
21147c2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Source pull request: ggml-org/llama.cpp#19635
Summary
Step-3.5-Flash (196B MoE) uses the same XML-style tool call format as Qwen3-Coder (
<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare<function>and plural<parameters>markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't callfunc_args_not_string(), so arguments stayed as JSON strings and templates usingarguments|itemscrashed.Reported by multiple users in #19283:
arguments|itemscrash with opencode (@jacekpoplawski)Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit
<think>in their generation prompt need the samethinking_forced_openhandling that Nemotron v3 and Hermes 2 Pro already have, otherwisereasoning_contentis never separated fromcontentin API responses.Changes
<tool_call>,<function=,<parameter=)<function>and plural<parameters>, preventing Step-3.5-Flash from being misrouted via<think>thinking_forced_opensupport to Qwen3-Coder-XML init function (same pattern as Nemotron v3, Hermes 2 Pro, Granite)<think>/</think>to preserved tokensbuild_grammar_xml_tool_callto handlethinking_forced_openin the grammar root rule, allowing</think>before tool calls whentool_choice=required(also fixes pre-existing bug in MiniMax-M2 and GLM 4.5)Testing
test-chatpasses (format detected asCOMMON_CHAT_FORMAT_QWEN3_CODER_XML)/v1/chat/completions: correcttool_callswith parsed arguments/v1/chat/completions:reasoning_contentproperly separated fromcontent--reasoning-format auto(default)Related
AI Disclosure
Claude was used for codebase exploration, pattern identification, and drafting. All changes follow established patterns from existing format handlers (Nemotron v3, Hermes 2 Pro, MiniMax-M2). Fully tested locally.