Skip to content

[Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions#37044

Open
he-yufeng wants to merge 2 commits intovllm-project:mainfrom
he-yufeng:fix/reasoning-parser-regressions
Open

[Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions#37044
he-yufeng wants to merge 2 commits intovllm-project:mainfrom
he-yufeng:fix/reasoning-parser-regressions

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

Summary

Two reasoning parser regressions were introduced by PR #33221 (which consolidated model-specific parsers into the generic DeepSeek V3 delegation chain):

GLM4 MoE — tagless text misclassified as reasoning instead of content.

GLM4 injects <think> via the chat template, so when the model output lacks </think>, it means the model chose not to reason. The R1-based parser incorrectly treats this as reasoning. Added a dedicated Glm4MoeReasoningParser that returns (None, content) when </think> is absent.

SeedOSS — streaming output inconsistent with non-streaming.

SeedOSSReasoningParser extends BaseThinkingReasoningParser directly but the base class's streaming path returns content for tagless text, while the non-streaming extract_reasoning() returns it as reasoning. Added the same R1-style streaming override so both paths agree.

Test plan

  • pytest tests/reasoning/test_glm4_moe_reasoning_parser.py — all 10 cases pass (without_think, without_think_stream, only_open_tag previously failing)
  • pytest tests/reasoning/test_seedoss_reasoning_parser.py — streaming cases pass (previously misclassifying tagless text)

Fixes #37023, fixes #37022

@mergify mergify bot added the bug Something isn't working label Mar 14, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two fixes for reasoning parser regressions in GLM4 MoE and SeedOSS models. The changes for SeedOSS correctly align streaming and non-streaming behavior. For GLM4 MoE, a new parser is introduced that handles non-streaming output without a closing </think> tag. However, I've identified an inconsistency between the streaming and non-streaming behavior in the new GLM4 parser. Specifically, output with an opening <think> tag but no closing tag is treated as reasoning in streaming mode but as content in non-streaming mode. This should be addressed to ensure consistent parsing logic across both modes.

Comment on lines +13 to +21
class Glm4MoeReasoningParser(BaseThinkingReasoningParser):
"""
Reasoning parser for GLM-4 MoE models.

Unlike DeepSeek R1, GLM-4 injects <think> via the chat template rather
than generating it. When the model output lacks </think>, the entire
output is treated as *content* (not reasoning), because the absence of
the end tag means the model chose not to reason.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency between the non-streaming and streaming behavior of this parser for outputs that contain <think> but not </think>.

  • The extract_reasoning method correctly implements the logic described in the docstring: if </think> is absent, the entire output is treated as content. For an input like "<think>some reasoning", it will return (None, "<think>some reasoning").

  • However, this class inherits extract_reasoning_streaming from BaseThinkingReasoningParser. The base implementation will treat "<think>some reasoning" as reasoning during streaming, which contradicts this parser's stated logic for handling outputs without a closing </think> tag.

This is the same type of inconsistency that this PR fixes for SeedOSSReasoningParser. To ensure consistent behavior, Glm4MoeReasoningParser should also override extract_reasoning_streaming. A potential approach is to buffer content after <think> and only flush it as reasoning once </think> is seen. If the stream ends before </think>, the buffer would be flushed as content.

Copy link
Copy Markdown
Contributor

@alvinttang alvinttang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

This PR fixes two distinct regressions — the GLM-4 MoE reasoning parser and the SeedOSS streaming parser. Both fixes are well-scoped.

GLM-4 MoE reasoning parser

The old mapping pointed "glm45" at DeepSeekV3ReasoningWithThinkingParser, which is incorrect for GLM-4 MoE. The new Glm4MoeReasoningParser correctly inherits from BaseThinkingReasoningParser and overrides the key semantic difference: when </think> is absent, the entire output is treated as content rather than reasoning. This matches GLM-4's behavior where <think> is injected via the chat template, and the model can opt out of reasoning by not emitting the end tag.

The extract_reasoning implementation is clean. One edge case to consider: if the model outputs <think></think> (empty reasoning), reasoning will be "" and content will be None (since content or None converts empty string to None). Is empty-string reasoning semantically meaningful here, or should it also be normalized to None?

SeedOSS streaming parser

The extract_reasoning_streaming override handles the case where the start token is in the chat template (not generated). The logic:

  1. If neither previous_token_ids nor delta_token_ids contain the start token ID...
  2. Check if end token is in delta → split into reasoning + content
  3. Check if end token is in previous → all delta is content
  4. Otherwise → all delta is reasoning

This is correct, but I have a concern about the token ID checks: self.start_token_id not in previous_token_ids does a linear scan of the full token history on every streaming chunk. For long generations, this could become expensive. Consider tracking whether the start token was seen via a boolean flag on the parser instance rather than re-scanning the history each time.

Also, the end_index = delta_text.find(self.end_token) on line that handles the end-token-in-delta case: if the end token is split across two delta chunks (rare but possible with some tokenizers), find will fail and the text won't be split correctly. The base class may already handle this via the partial-match machinery, so this might be fine — but worth verifying.

Missing tests

Neither parser has unit tests in this PR. Given that these are regression fixes, adding at least a couple of test cases for each (especially the "no end tag" and "streaming split" edge cases) would help prevent future regressions.

Overall, solid fixes. The GLM-4 parser is particularly clean.

@he-yufeng
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @alvinttang!

Empty reasoning: Good catch. "" reasoning should indeed be None — an empty think block means the model chose not to reason. I'll normalize it.

Linear scan concern: Valid point. However, this is inherited from DeepSeekR1ReasoningParser (same pattern at line 47-48 of deepseek_r1_reasoning_parser.py), so changing it here would diverge from the existing R1 behavior. I'd suggest addressing the linear scan optimization as a follow-up across all parsers that share this pattern.

Tests: The test cases already exist in tests/reasoning/test_glm4_moe_reasoning_parser.py (the file that was failing before this fix). The SeedOSS tests are in tests/reasoning/test_seedoss_reasoning_parser.py. Both were added as part of the CI regression PR #37025.

GLM4 MoE (vllm-project#37023):
PR vllm-project#33221 replaced the dedicated Glm4MoeModelReasoningParser with the
generic DeepSeekV3ReasoningWithThinkingParser, which delegates to R1.
R1 treats tagless text as reasoning, but GLM4 injects <think> via the
chat template, so tagless output means the model chose not to reason.
Add Glm4MoeReasoningParser that returns (None, content) when </think>
is absent, matching the expected semantics. Empty reasoning (<think></think>)
is normalized to None.

SeedOSS (vllm-project#37022):
The streaming path in BaseThinkingReasoningParser returns content for
tagless text, but SeedOSS (like R1) may not emit the start token. Add
the same R1-style streaming override so tagless streaming text is
correctly classified as reasoning.

Fixes vllm-project#37023, fixes vllm-project#37022

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
@he-yufeng he-yufeng force-pushed the fix/reasoning-parser-regressions branch from 10ae5cf to 7ae12d6 Compare March 15, 2026 05:49
<think></think> means the model chose not to reason, so empty string
reasoning should be None, not "". Added test cases for this edge case.
@he-yufeng
Copy link
Copy Markdown
Contributor Author

Thanks again for the detailed review @alvinttang! Pushed a fix (158912e):

Empty reasoning normalization — Added an explicit comment clarifying the or None normalization on the return, and added test cases for <think></think> (empty think block with and without trailing content). The or None was already handling this correctly, but now the intent is documented and tested.

GLM-4 MoE streaming inconsistency (also @gemini-code-assist) — The streaming/non-streaming divergence for <think>... (no closing tag) is intentional and unavoidable. During streaming, we can't know if </think> will arrive in a future chunk, so the base class must optimistically route text as reasoning until the end token appears. The non-streaming path has the complete output and can make the definitive call that no </think> means the model opted out of reasoning. This is the same tradeoff every think-tag parser makes — it's inherent to incremental parsing.

SeedOSS end token split across chunks — This is safe because the split check operates on token IDs (self.end_token_id in delta_token_ids), not raw text. Since </seed:think> is a single special token in the vocabulary, it will always appear as one token ID in delta_token_ids — it can't be split across chunks at the token level. The delta_text.find(self.end_token) is only reached after confirming the token ID is present in the delta.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thanks for the detailed explanation and for pushing the fix! The clarification on the streaming/non-streaming divergence for <think> tags and the confirmation that the SeedOSS end token check is safe at the token ID level are very helpful. The added documentation and test cases for empty think blocks look great. I have no further concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI Failure]: GLM4 moe reasoning parser test failure [CI Failure]: Seedoss reasoning parser test failure

2 participants