[Bugfix] Fix Gemma4 reasoning for batch chat completions by Kimahriman · Pull Request #42105 · vllm-project/vllm

Kimahriman · 2026-05-08T18:22:30Z

Summary

Batch chat completions were not running reasoning parser request adjustment during preprocessing. This meant Gemma4ReasoningParser.adjust_request() was skipped for /v1/chat/completions/batch, leaving skip_special_tokens=True and allowing Gemma 4 reasoning delimiters such as <|channel> and <channel|> to be dropped before final reasoning parsing.

This PR makes the batch path mirror regular chat serving more closely:

Build each per-conversation ChatCompletionRequest inside batch rendering.
Pass reasoning_parser=self.reasoning_parser_cls into preprocess_chat(...).
Return and reuse the adjusted per-conversation requests for sampling params, engine reasoning state, and final parsing.
Pass reasoning_ended and reasoning_parser_kwargs into engine_client.generate(...) consistently with non-batch chat serving.
Add a regression test covering the adjusted request flow.

Duplicate-work check

I checked for existing work before opening this PR:

gh issue view 42103 --repo vllm-project/vllm --comments
gh pr list --repo vllm-project/vllm --state open --search "42103 in:body"
gh pr list --repo vllm-project/vllm --state open --search "Gemma4 batch chat reasoning parser"
gh pr list --repo vllm-project/vllm --state open --search "batch chat completions reasoning parser"

I found related Gemma 4 parser work in #39027, but no open PR addressing this batch chat completions bug.

Testing

pre-commit hooks during commit: passed, including ruff check, ruff format, typos, mypy-local, SPDX checks, and signoff.
.venv/bin/pre-commit run ruff-format --files vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed.
.venv/bin/pre-commit run ruff-check --files vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed.
.venv/bin/python -m py_compile vllm/entrypoints/openai/chat_completion/batch_serving.py tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py: passed.

New test passes locally.

AI assistance

AI assistance was used to inspect the serving paths, implement the fix, add the regression test, and draft this PR description. I have reviewed and verified the changes.

Ensure batch chat generation uses the adjusted per-conversation ChatCompletionRequest objects returned by preprocessing, and pass reasoning state into engine generation consistently with non-batch chat serving. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Adam Binford <adamq43@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Kimahriman · 2026-05-08T18:25:24Z



+@pytest.mark.asyncio
+async def test_batch_render_uses_adjusted_reasoning_requests() -> None:


The other tests in this module seem like purely integration tests with a real server, not sure if that matters

Kimahriman · 2026-05-08T18:26:46Z

+            if (
+                not single_request.include_reasoning
+                or single_request._grammar_from_tool_parser
+            ):
+                reasoning_ended = True
+            elif reasoning_parser:
+                reasoning_ended = reasoning_parser.is_reasoning_end(
+                    prompt_token_ids or []
+                )
+            else:
+                reasoning_ended = None
+            chat_template_kwargs = self._effective_chat_template_kwargs(single_request)


These changes aren't directly related to the bug described, but are part of what is inconsistent with the regular chat completions logic. I can pull these out if desired. Also, it may be worth trying to abstract more of the common logic into the non-batch serving module to help prevent future regressions

gemini-code-assist

Code Review

This pull request enhances the batched chat completion functionality by ensuring that reasoning parsers and request-specific configurations are correctly handled for each individual request within a batch. Key changes include updating render_batch_chat_request to return individual request objects, implementing per-request reasoning end detection, and ensuring that roles and reasoning extraction are correctly applied to each completion choice. A new test case was also added to verify these improvements. I have no feedback to provide.

Kimahriman · 2026-05-08T18:29:07Z

cc @bbrowning since you made the original adjust_request fix

alexbi29 · 2026-05-17T11:23:02Z

Good catch, but there are two more paths with the same bug:

1. Regular /v1/chat/completions streaming (chat_completion/serving.py)
adjust_request is never called here either. The reasoning parser gets instantiated around line 252 but skip_special_tokens stays True. Quick fix:

if self.reasoning_parser_cls:
    reasoning_parser = self.reasoning_parser_cls(tokenizer, ...)
    request = reasoning_parser.adjust_request(request)  # missing
result = await self.render_chat_request(request)

2. /v1/responses (responses/serving.py)
Same deal — reasoning_parser_cls is used in a few places but adjust_request is never called.

Both paths go through preprocess_chat in serve/render/serving.py, which already handles adjust_request correctly when reasoning_parser is passed (line 580). Same pattern you used for batch would close all three gaps cleanly.

(Found these while debugging Gemma4 <|channel>thought tokens leaking into responses — the streaming path in particular causes real-world pain.)

Kimahriman · 2026-05-17T15:04:01Z

1. Regular /v1/chat/completions streaming (chat_completion/serving.py) adjust_request is never called here either. The reasoning parser gets instantiated around line 252 but skip_special_tokens stays True. Quick fix:
if self.reasoning_parser_cls:
    reasoning_parser = self.reasoning_parser_cls(tokenizer, ...)
    request = reasoning_parser.adjust_request(request)  # missing
result = await self.render_chat_request(request)
2. /v1/responses (responses/serving.py) Same deal — reasoning_parser_cls is used in a few places but adjust_request is never called.

Both paths go through preprocess_chat in serve/render/serving.py, which already handles adjust_request correctly when reasoning_parser is passed (line 580). Same pattern you used for batch would close all three gaps cleanly.

(Found these while debugging Gemma4 <|channel>thought tokens leaking into responses — the streaming path in particular causes real-world pain.)

The openai_serving_render already has the reasoning_parser builtin to it (though awkwardly by assigning reasoning_parser to structured_outputs_config and then pulling it off there.) So chat completions should be working fine, and streaming and non-streaming request preprocessing use the same path. I haven't had any issues with this with streaming chat completions. chat_template_kwargs support in responses was only just added so I haven't had a chance to try that out, but it calls preprocess_chat directly with the configured reasoning_parser, so that should be working fine too. If you actually see <|channel>thought in the output that's unrelated to this bug, since that would mean special tokens were indeed not skipped like adjust_request is supposed to do.

alexbi29 · 2026-05-17T21:21:41Z

Yeah you're right, sorry.

What happened on my end: I saw the reasoning parser get instantiated at chat_completion/serving.py:252 without an .adjust_request() next to it and pattern-matched it to the batch bug without tracing the call chain further. The actual adjust_request is two hops earlier — create_chat_completion → render_chat_request → OpenAIServingRender.render_chat → preprocess_chat(..., reasoning_parser=self.reasoning_parser), and preprocess_chat does the .adjust_request() call. Same story for /v1/responses: 617 and 642 already pass reasoning_parser= to preprocess_chat. The instantiation I was looking at is just a second instance used later for is_reasoning_end / extract_reasoning post-processing.

The <|channel>thought leaks I was chasing turned out to be a separate streaming-side bug, already addressed in #42875 — totally unrelated to adjust_request plumbing. Verified on a fresh vLLM bounce with no local patches: reasoning, tool calls, hostile prompts asking for literal <|channel> — all routed correctly into reasoning / structured tool_calls, nothing leaking. Your batch fix stands as-is, ignore the noise.

mergify · 2026-05-24T11:35:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Kimahriman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added frontend bug Something isn't working labels May 8, 2026

Kimahriman marked this pull request as ready for review May 8, 2026 18:24

Kimahriman requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners May 8, 2026 18:24

claude Bot reviewed May 8, 2026

View reviewed changes

Kimahriman commented May 8, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

DarkLight1337 added the verified Run pre-commit for new contributors without triggering other tests label May 9, 2026

mergify Bot added the needs-rebase label May 24, 2026

Merge branch 'main' into batch-chat-reasoning-adjust

5dec5ea

mergify Bot removed the needs-rebase label May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix Gemma4 reasoning for batch chat completions#42105

[Bugfix] Fix Gemma4 reasoning for batch chat completions#42105
Kimahriman wants to merge 2 commits into
vllm-project:mainfrom
Kimahriman:batch-chat-reasoning-adjust

Kimahriman commented May 8, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Kimahriman May 8, 2026

Uh oh!

Kimahriman May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Kimahriman commented May 8, 2026

Uh oh!

alexbi29 commented May 17, 2026

Uh oh!

Kimahriman commented May 17, 2026 •

edited

Loading

Uh oh!

alexbi29 commented May 17, 2026

Uh oh!

mergify Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@pytest.mark.asyncio
		async def test_batch_render_uses_adjusted_reasoning_requests() -> None:

Uh oh!

Conversation

Kimahriman commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Duplicate-work check

Testing

AI assistance

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Kimahriman May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Kimahriman May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Kimahriman commented May 8, 2026

Uh oh!

alexbi29 commented May 17, 2026

Uh oh!

Kimahriman commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexbi29 commented May 17, 2026

Uh oh!

mergify Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kimahriman commented May 8, 2026 •

edited

Loading

Kimahriman commented May 17, 2026 •

edited

Loading