-
-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[Bugfix] Fix GLM4 MoE and SeedOSS reasoning parser regressions #37044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
he-yufeng
wants to merge
2
commits into
vllm-project:main
Choose a base branch
from
he-yufeng:fix/reasoning-parser-regressions
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+124
−2
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
||
| from vllm.reasoning.basic_parsers import BaseThinkingReasoningParser | ||
|
|
||
| if TYPE_CHECKING: | ||
| from vllm.entrypoints.openai.chat_completion.protocol import ChatCompletionRequest | ||
| from vllm.entrypoints.openai.responses.protocol import ResponsesRequest | ||
|
|
||
|
|
||
| class Glm4MoeReasoningParser(BaseThinkingReasoningParser): | ||
| """ | ||
| Reasoning parser for GLM-4 MoE models. | ||
|
|
||
| Unlike DeepSeek R1, GLM-4 injects <think> via the chat template rather | ||
| than generating it. When the model output lacks </think>, the entire | ||
| output is treated as *content* (not reasoning), because the absence of | ||
| the end tag means the model chose not to reason. | ||
| """ | ||
|
|
||
| @property | ||
| def start_token(self) -> str: | ||
| return "<think>" | ||
|
|
||
| @property | ||
| def end_token(self) -> str: | ||
| return "</think>" | ||
|
|
||
| def extract_reasoning( | ||
| self, model_output: str, request: "ChatCompletionRequest | ResponsesRequest" | ||
| ) -> tuple[str | None, str | None]: | ||
| if self.end_token not in model_output: | ||
| # No closing tag — model didn't produce reasoning. | ||
| # Return the full original output as content. | ||
| return None, model_output | ||
|
|
||
| # Normal case: <think>reasoning</think>content | ||
| parts = model_output.partition(self.start_token) | ||
| after_start = parts[2] if parts[1] else parts[0] | ||
| reasoning, _, content = after_start.partition(self.end_token) | ||
|
|
||
| # Normalize empty strings to None -- <think></think> means | ||
| # the model chose not to reason, not that reasoning is "". | ||
| return reasoning or None, content or None | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency between the non-streaming and streaming behavior of this parser for outputs that contain
<think>but not</think>.The
extract_reasoningmethod correctly implements the logic described in the docstring: if</think>is absent, the entire output is treated as content. For an input like"<think>some reasoning", it will return(None, "<think>some reasoning").However, this class inherits
extract_reasoning_streamingfromBaseThinkingReasoningParser. The base implementation will treat"<think>some reasoning"as reasoning during streaming, which contradicts this parser's stated logic for handling outputs without a closing</think>tag.This is the same type of inconsistency that this PR fixes for
SeedOSSReasoningParser. To ensure consistent behavior,Glm4MoeReasoningParsershould also overrideextract_reasoning_streaming. A potential approach is to buffer content after<think>and only flush it as reasoning once</think>is seen. If the stream ends before</think>, the buffer would be flushed as content.