[Bugfix] Fix Gemma4 reasoning for batch chat completions#42105
[Bugfix] Fix Gemma4 reasoning for batch chat completions#42105Kimahriman wants to merge 2 commits into
Conversation
Ensure batch chat generation uses the adjusted per-conversation ChatCompletionRequest objects returned by preprocessing, and pass reasoning state into engine generation consistently with non-batch chat serving. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Adam Binford <adamq43@gmail.com>
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_batch_render_uses_adjusted_reasoning_requests() -> None: |
There was a problem hiding this comment.
The other tests in this module seem like purely integration tests with a real server, not sure if that matters
| if ( | ||
| not single_request.include_reasoning | ||
| or single_request._grammar_from_tool_parser | ||
| ): | ||
| reasoning_ended = True | ||
| elif reasoning_parser: | ||
| reasoning_ended = reasoning_parser.is_reasoning_end( | ||
| prompt_token_ids or [] | ||
| ) | ||
| else: | ||
| reasoning_ended = None | ||
| chat_template_kwargs = self._effective_chat_template_kwargs(single_request) |
There was a problem hiding this comment.
These changes aren't directly related to the bug described, but are part of what is inconsistent with the regular chat completions logic. I can pull these out if desired. Also, it may be worth trying to abstract more of the common logic into the non-batch serving module to help prevent future regressions
There was a problem hiding this comment.
Code Review
This pull request enhances the batched chat completion functionality by ensuring that reasoning parsers and request-specific configurations are correctly handled for each individual request within a batch. Key changes include updating render_batch_chat_request to return individual request objects, implementing per-request reasoning end detection, and ensuring that roles and reasoning extraction are correctly applied to each completion choice. A new test case was also added to verify these improvements. I have no feedback to provide.
|
cc @bbrowning since you made the original adjust_request fix |
|
Good catch, but there are two more paths with the same bug: 1. Regular if self.reasoning_parser_cls:
reasoning_parser = self.reasoning_parser_cls(tokenizer, ...)
request = reasoning_parser.adjust_request(request) # missing
result = await self.render_chat_request(request)2. Both paths go through (Found these while debugging Gemma4 |
The |
|
Yeah you're right, sorry. What happened on my end: I saw the reasoning parser get instantiated at The |
|
This pull request has merge conflicts that must be resolved before it can be |
Fixes #42103.
Summary
Batch chat completions were not running reasoning parser request adjustment during preprocessing. This meant
Gemma4ReasoningParser.adjust_request()was skipped for/v1/chat/completions/batch, leavingskip_special_tokens=Trueand allowing Gemma 4 reasoning delimiters such as<|channel>and<channel|>to be dropped before final reasoning parsing.This PR makes the batch path mirror regular chat serving more closely:
ChatCompletionRequestinside batch rendering.reasoning_parser=self.reasoning_parser_clsintopreprocess_chat(...).reasoning_endedandreasoning_parser_kwargsintoengine_client.generate(...)consistently with non-batch chat serving.Duplicate-work check
I checked for existing work before opening this PR:
gh issue view 42103 --repo vllm-project/vllm --commentsgh pr list --repo vllm-project/vllm --state open --search "42103 in:body"gh pr list --repo vllm-project/vllm --state open --search "Gemma4 batch chat reasoning parser"gh pr list --repo vllm-project/vllm --state open --search "batch chat completions reasoning parser"I found related Gemma 4 parser work in #39027, but no open PR addressing this batch chat completions bug.
Testing
pre-commithooks during commit: passed, including ruff check, ruff format, typos, mypy-local, SPDX checks, and signoff..venv/bin/pre-commit run ruff-format --files vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed..venv/bin/pre-commit run ruff-check --files vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed..venv/bin/python -m py_compile vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed.New test passes locally.
AI assistance
AI assistance was used to inspect the serving paths, implement the fix, add the regression test, and draft this PR description. I have reviewed and verified the changes.