Skip to content

[Bugfix] Fix Gemma4 reasoning for batch chat completions#42105

Open
Kimahriman wants to merge 2 commits into
vllm-project:mainfrom
Kimahriman:batch-chat-reasoning-adjust
Open

[Bugfix] Fix Gemma4 reasoning for batch chat completions#42105
Kimahriman wants to merge 2 commits into
vllm-project:mainfrom
Kimahriman:batch-chat-reasoning-adjust

Conversation

@Kimahriman
Copy link
Copy Markdown
Contributor

@Kimahriman Kimahriman commented May 8, 2026

Fixes #42103.

Summary

Batch chat completions were not running reasoning parser request adjustment during preprocessing. This meant Gemma4ReasoningParser.adjust_request() was skipped for /v1/chat/completions/batch, leaving skip_special_tokens=True and allowing Gemma 4 reasoning delimiters such as <|channel> and <channel|> to be dropped before final reasoning parsing.

This PR makes the batch path mirror regular chat serving more closely:

  • Build each per-conversation ChatCompletionRequest inside batch rendering.
  • Pass reasoning_parser=self.reasoning_parser_cls into preprocess_chat(...).
  • Return and reuse the adjusted per-conversation requests for sampling params, engine reasoning state, and final parsing.
  • Pass reasoning_ended and reasoning_parser_kwargs into engine_client.generate(...) consistently with non-batch chat serving.
  • Add a regression test covering the adjusted request flow.

Duplicate-work check

I checked for existing work before opening this PR:

  • gh issue view 42103 --repo vllm-project/vllm --comments
  • gh pr list --repo vllm-project/vllm --state open --search "42103 in:body"
  • gh pr list --repo vllm-project/vllm --state open --search "Gemma4 batch chat reasoning parser"
  • gh pr list --repo vllm-project/vllm --state open --search "batch chat completions reasoning parser"

I found related Gemma 4 parser work in #39027, but no open PR addressing this batch chat completions bug.

Testing

  • pre-commit hooks during commit: passed, including ruff check, ruff format, typos, mypy-local, SPDX checks, and signoff.
  • .venv/bin/pre-commit run ruff-format --files vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed.
  • .venv/bin/pre-commit run ruff-check --files vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed.
  • .venv/bin/python -m py_compile vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed.

New test passes locally.

AI assistance

AI assistance was used to inspect the serving paths, implement the fix, add the regression test, and draft this PR description. I have reviewed and verified the changes.

Ensure batch chat generation uses the adjusted per-conversation ChatCompletionRequest objects returned by preprocessing, and pass reasoning state into engine generation consistently with non-batch chat serving.

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: Adam Binford <adamq43@gmail.com>
@mergify mergify Bot added frontend bug Something isn't working labels May 8, 2026
@Kimahriman Kimahriman marked this pull request as ready for review May 8, 2026 18:24
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.



@pytest.mark.asyncio
async def test_batch_render_uses_adjusted_reasoning_requests() -> None:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other tests in this module seem like purely integration tests with a real server, not sure if that matters

Comment on lines +185 to +196
if (
not single_request.include_reasoning
or single_request._grammar_from_tool_parser
):
reasoning_ended = True
elif reasoning_parser:
reasoning_ended = reasoning_parser.is_reasoning_end(
prompt_token_ids or []
)
else:
reasoning_ended = None
chat_template_kwargs = self._effective_chat_template_kwargs(single_request)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't directly related to the bug described, but are part of what is inconsistent with the regular chat completions logic. I can pull these out if desired. Also, it may be worth trying to abstract more of the common logic into the non-batch serving module to help prevent future regressions

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the batched chat completion functionality by ensuring that reasoning parsers and request-specific configurations are correctly handled for each individual request within a batch. Key changes include updating render_batch_chat_request to return individual request objects, implementing per-request reasoning end detection, and ensuring that roles and reasoning extraction are correctly applied to each completion choice. A new test case was also added to verify these improvements. I have no feedback to provide.

@Kimahriman
Copy link
Copy Markdown
Contributor Author

cc @bbrowning since you made the original adjust_request fix

@DarkLight1337 DarkLight1337 added the verified Run pre-commit for new contributors without triggering other tests label May 9, 2026
@alexbi29
Copy link
Copy Markdown

Good catch, but there are two more paths with the same bug:

1. Regular /v1/chat/completions streaming (chat_completion/serving.py)
adjust_request is never called here either. The reasoning parser gets instantiated around line 252 but skip_special_tokens stays True. Quick fix:

if self.reasoning_parser_cls:
    reasoning_parser = self.reasoning_parser_cls(tokenizer, ...)
    request = reasoning_parser.adjust_request(request)  # missing
result = await self.render_chat_request(request)

2. /v1/responses (responses/serving.py)
Same deal — reasoning_parser_cls is used in a few places but adjust_request is never called.

Both paths go through preprocess_chat in serve/render/serving.py, which already handles adjust_request correctly when reasoning_parser is passed (line 580). Same pattern you used for batch would close all three gaps cleanly.

(Found these while debugging Gemma4 <|channel>thought tokens leaking into responses — the streaming path in particular causes real-world pain.)

@Kimahriman
Copy link
Copy Markdown
Contributor Author

Kimahriman commented May 17, 2026

1. Regular /v1/chat/completions streaming (chat_completion/serving.py) adjust_request is never called here either. The reasoning parser gets instantiated around line 252 but skip_special_tokens stays True. Quick fix:

if self.reasoning_parser_cls:
    reasoning_parser = self.reasoning_parser_cls(tokenizer, ...)
    request = reasoning_parser.adjust_request(request)  # missing
result = await self.render_chat_request(request)

2. /v1/responses (responses/serving.py) Same deal — reasoning_parser_cls is used in a few places but adjust_request is never called.

Both paths go through preprocess_chat in serve/render/serving.py, which already handles adjust_request correctly when reasoning_parser is passed (line 580). Same pattern you used for batch would close all three gaps cleanly.

(Found these while debugging Gemma4 <|channel>thought tokens leaking into responses — the streaming path in particular causes real-world pain.)

The openai_serving_render already has the reasoning_parser builtin to it (though awkwardly by assigning reasoning_parser to structured_outputs_config and then pulling it off there.) So chat completions should be working fine, and streaming and non-streaming request preprocessing use the same path. I haven't had any issues with this with streaming chat completions. chat_template_kwargs support in responses was only just added so I haven't had a chance to try that out, but it calls preprocess_chat directly with the configured reasoning_parser, so that should be working fine too. If you actually see <|channel>thought in the output that's unrelated to this bug, since that would mean special tokens were indeed not skipped like adjust_request is supposed to do.

@alexbi29
Copy link
Copy Markdown

Yeah you're right, sorry.

What happened on my end: I saw the reasoning parser get instantiated at chat_completion/serving.py:252 without an .adjust_request() next to it and pattern-matched it to the batch bug without tracing the call chain further. The actual adjust_request is two hops earlier — create_chat_completionrender_chat_requestOpenAIServingRender.render_chatpreprocess_chat(..., reasoning_parser=self.reasoning_parser), and preprocess_chat does the .adjust_request() call. Same story for /v1/responses: 617 and 642 already pass reasoning_parser= to preprocess_chat. The instantiation I was looking at is just a second instance used later for is_reasoning_end / extract_reasoning post-processing.

The <|channel>thought leaks I was chasing turned out to be a separate streaming-side bug, already addressed in #42875 — totally unrelated to adjust_request plumbing. Verified on a fresh vLLM bounce with no local patches: reasoning, tool calls, hostile prompts asking for literal <|channel> — all routed correctly into reasoning / structured tool_calls, nothing leaking. Your batch fix stands as-is, ignore the noise.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Kimahriman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 24, 2026
@mergify mergify Bot removed the needs-rebase label May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend verified Run pre-commit for new contributors without triggering other tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Batch chat completions drop Gemma 4 reasoning delimiters

3 participants