[Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions#37044
[Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions#37044he-yufeng wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces two fixes for reasoning parser regressions in GLM4 MoE and SeedOSS models. The changes for SeedOSS correctly align streaming and non-streaming behavior. For GLM4 MoE, a new parser is introduced that handles non-streaming output without a closing </think> tag. However, I've identified an inconsistency between the streaming and non-streaming behavior in the new GLM4 parser. Specifically, output with an opening <think> tag but no closing tag is treated as reasoning in streaming mode but as content in non-streaming mode. This should be addressed to ensure consistent parsing logic across both modes.
| class Glm4MoeReasoningParser(BaseThinkingReasoningParser): | ||
| """ | ||
| Reasoning parser for GLM-4 MoE models. | ||
|
|
||
| Unlike DeepSeek R1, GLM-4 injects <think> via the chat template rather | ||
| than generating it. When the model output lacks </think>, the entire | ||
| output is treated as *content* (not reasoning), because the absence of | ||
| the end tag means the model chose not to reason. | ||
| """ |
There was a problem hiding this comment.
There's an inconsistency between the non-streaming and streaming behavior of this parser for outputs that contain <think> but not </think>.
-
The
extract_reasoningmethod correctly implements the logic described in the docstring: if</think>is absent, the entire output is treated as content. For an input like"<think>some reasoning", it will return(None, "<think>some reasoning"). -
However, this class inherits
extract_reasoning_streamingfromBaseThinkingReasoningParser. The base implementation will treat"<think>some reasoning"as reasoning during streaming, which contradicts this parser's stated logic for handling outputs without a closing</think>tag.
This is the same type of inconsistency that this PR fixes for SeedOSSReasoningParser. To ensure consistent behavior, Glm4MoeReasoningParser should also override extract_reasoning_streaming. A potential approach is to buffer content after <think> and only flush it as reasoning once </think> is seen. If the stream ends before </think>, the buffer would be flushed as content.
alvinttang
left a comment
There was a problem hiding this comment.
Review
This PR fixes two distinct regressions — the GLM-4 MoE reasoning parser and the SeedOSS streaming parser. Both fixes are well-scoped.
GLM-4 MoE reasoning parser
The old mapping pointed "glm45" at DeepSeekV3ReasoningWithThinkingParser, which is incorrect for GLM-4 MoE. The new Glm4MoeReasoningParser correctly inherits from BaseThinkingReasoningParser and overrides the key semantic difference: when </think> is absent, the entire output is treated as content rather than reasoning. This matches GLM-4's behavior where <think> is injected via the chat template, and the model can opt out of reasoning by not emitting the end tag.
The extract_reasoning implementation is clean. One edge case to consider: if the model outputs <think></think> (empty reasoning), reasoning will be "" and content will be None (since content or None converts empty string to None). Is empty-string reasoning semantically meaningful here, or should it also be normalized to None?
SeedOSS streaming parser
The extract_reasoning_streaming override handles the case where the start token is in the chat template (not generated). The logic:
- If neither
previous_token_idsnordelta_token_idscontain the start token ID... - Check if end token is in delta → split into reasoning + content
- Check if end token is in previous → all delta is content
- Otherwise → all delta is reasoning
This is correct, but I have a concern about the token ID checks: self.start_token_id not in previous_token_ids does a linear scan of the full token history on every streaming chunk. For long generations, this could become expensive. Consider tracking whether the start token was seen via a boolean flag on the parser instance rather than re-scanning the history each time.
Also, the end_index = delta_text.find(self.end_token) on line that handles the end-token-in-delta case: if the end token is split across two delta chunks (rare but possible with some tokenizers), find will fail and the text won't be split correctly. The base class may already handle this via the partial-match machinery, so this might be fine — but worth verifying.
Missing tests
Neither parser has unit tests in this PR. Given that these are regression fixes, adding at least a couple of test cases for each (especially the "no end tag" and "streaming split" edge cases) would help prevent future regressions.
Overall, solid fixes. The GLM-4 parser is particularly clean.
|
Thanks for the thorough review @alvinttang! Empty reasoning: Good catch. Linear scan concern: Valid point. However, this is inherited from Tests: The test cases already exist in |
GLM4 MoE (vllm-project#37023): PR vllm-project#33221 replaced the dedicated Glm4MoeModelReasoningParser with the generic DeepSeekV3ReasoningWithThinkingParser, which delegates to R1. R1 treats tagless text as reasoning, but GLM4 injects <think> via the chat template, so tagless output means the model chose not to reason. Add Glm4MoeReasoningParser that returns (None, content) when </think> is absent, matching the expected semantics. Empty reasoning (<think></think>) is normalized to None. SeedOSS (vllm-project#37022): The streaming path in BaseThinkingReasoningParser returns content for tagless text, but SeedOSS (like R1) may not emit the start token. Add the same R1-style streaming override so tagless streaming text is correctly classified as reasoning. Fixes vllm-project#37023, fixes vllm-project#37022 Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
10ae5cf to
7ae12d6
Compare
<think></think> means the model chose not to reason, so empty string reasoning should be None, not "". Added test cases for this edge case.
|
Thanks again for the detailed review @alvinttang! Pushed a fix (158912e): Empty reasoning normalization — Added an explicit comment clarifying the GLM-4 MoE streaming inconsistency (also @gemini-code-assist) — The streaming/non-streaming divergence for SeedOSS end token split across chunks — This is safe because the split check operates on token IDs ( |
|
Thanks for the detailed explanation and for pushing the fix! The clarification on the streaming/non-streaming divergence for |
Summary
Two reasoning parser regressions were introduced by PR #33221 (which consolidated model-specific parsers into the generic DeepSeek V3 delegation chain):
GLM4 MoE — tagless text misclassified as reasoning instead of content.
GLM4 injects
<think>via the chat template, so when the model output lacks</think>, it means the model chose not to reason. The R1-based parser incorrectly treats this as reasoning. Added a dedicatedGlm4MoeReasoningParserthat returns(None, content)when</think>is absent.SeedOSS — streaming output inconsistent with non-streaming.
SeedOSSReasoningParserextendsBaseThinkingReasoningParserdirectly but the base class's streaming path returnscontentfor tagless text, while the non-streamingextract_reasoning()returns it asreasoning. Added the same R1-style streaming override so both paths agree.Test plan
pytest tests/reasoning/test_glm4_moe_reasoning_parser.py— all 10 cases pass (without_think, without_think_stream, only_open_tag previously failing)pytest tests/reasoning/test_seedoss_reasoning_parser.py— streaming cases pass (previously misclassifying tagless text)Fixes #37023, fixes #37022