reasoning: fix tool-call routing bug in GLM-5.1-FP8 at long context (AIVLLM-229)#40782
reasoning: fix tool-call routing bug in GLM-5.1-FP8 at long context (AIVLLM-229)#40782jatseng-ai wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a three-phase fix for the GLM-4.5 model to handle tool-call XML blocks that are incorrectly emitted inside reasoning tags. It includes a new _split_embedded_tool_calls method to promote these blocks to the content field and comprehensive unit tests for various edge cases. Feedback highlights a regression where essential properties were removed from the parser and a potential issue with a regular expression being too greedy, which could lead to the accidental removal of reasoning prose.
| @@ -40,14 +76,6 @@ def __init__(self, tokenizer: PreTrainedTokenizerBase, *args, **kwargs): | |||
| else: | |||
| self._parser = IdentityReasoningParser(tokenizer, *args, **kwargs) | |||
|
|
|||
| @property | |||
| def reasoning_start_str(self) -> str | None: | |||
| return self._parser.reasoning_start_str | |||
|
|
|||
| @property | |||
| def reasoning_end_str(self) -> str | None: | |||
| return self._parser.reasoning_end_str | |||
There was a problem hiding this comment.
The removal of the reasoning_start_str and reasoning_end_str properties from DeepSeekV3ReasoningParser is a regression. These properties are essential for delegating to the underlying parser (e.g., DeepSeekR1ReasoningParser), which defines the actual delimiters like <think> and </think>. Without these, the parser will return None for these delimiters, which can break reasoning block detection in the engine or API layers that rely on them.
| _BODY_WITHOUT_OPEN_RE = re.compile( | ||
| r"^(?:[\w.-]+\n?)?" # optional function name (no spaces, optional newline) | ||
| r"(?:<arg_key>.*?</arg_key>" # at least one complete key-value pair | ||
| r"<arg_value>.*?</arg_value>)*" | ||
| r"<arg_key>.*?</arg_key>" # required: at least one <arg_key> must be present | ||
| r"<arg_value>.*?" # value (may be truncated) | ||
| r"(?:</arg_value>)?" | ||
| r"</tool_call>" # required closing tag | ||
| r".*$", | ||
| re.DOTALL, | ||
| ) |
There was a problem hiding this comment.
The regex _BODY_WITHOUT_OPEN_RE uses .*$ at the end, which greedily captures any text following the </tool_call> tag. If the model produces additional reasoning prose after the tool call body within the same block, that prose will be incorrectly moved to the content field and removed from reasoning_content. Changing this to \s*$ ensures that Phase 2 only triggers when the entire remaining string is a tool call body (optionally followed by whitespace), preventing accidental loss of reasoning prose.
| _BODY_WITHOUT_OPEN_RE = re.compile( | |
| r"^(?:[\w.-]+\n?)?" # optional function name (no spaces, optional newline) | |
| r"(?:<arg_key>.*?</arg_key>" # at least one complete key-value pair | |
| r"<arg_value>.*?</arg_value>)*" | |
| r"<arg_key>.*?</arg_key>" # required: at least one <arg_key> must be present | |
| r"<arg_value>.*?" # value (may be truncated) | |
| r"(?:</arg_value>)?" | |
| r"</tool_call>" # required closing tag | |
| r".*$", | |
| re.DOTALL, | |
| ) | |
| _BODY_WITHOUT_OPEN_RE = re.compile( | |
| r"^(?:[\w.-]+\n?)?" # optional function name (no spaces, optional newline) | |
| r"(?:<arg_key>.*?</arg_key>" | |
| r"<arg_value>.*?</arg_value>)*" | |
| r"<arg_key>.*?</arg_key>" # required: at least one <arg_key> must be present | |
| r"<arg_value>.*?" # value (may be truncated) | |
| r"(?:</arg_value>)?" | |
| r"</tool_call>" | |
| r"\s*$", | |
| re.DOTALL, | |
| ) |
Summary
GLM-5.1-FP8 places
<tool_call>XML inside<think>blocks when prompt context exceeds ~8k tokens. vLLM's glm45 reasoning parser captures this inreasoning_contentinstead of routing it totool_calls, producingfinish_reason=stopinstead offinish_reason=tool_calls.Three-phase fix in
DeepSeekV3ReasoningWithThinkingParser:<tool_call>…</tool_call>blocks inside reasoning promoted verbatim<tool_call>19/19 unit tests pass. Live test enforces 8k-token gate. Confirmed PASS at 10,431 prompt tokens.
Analogous to #39055 (Qwen3) but requires additional phases because GLM's
<tool_call>open token is a special token stripped from text output.Ref: AIVLLM-229