[rollout] fix: use configured response_length as default max_tokens in vLLM async server by yurekami · Pull Request #4703 · verl-project/verl

yurekami · 2025-12-28T16:00:19Z

Summary

This PR fixes the max_tokens calculation bug in the vLLM async server where short prompts could cause over-generation.

Problem

When no max_tokens is provided in sampling_params, the default was calculated as:

response_length = self.config.max_model_len - len(prompt_ids)

Since max_model_len = prompt_length + response_length, for short prompts this resulted in max_tokens being much larger than the intended response_length, causing:

Unnecessary over-generation
Latency spikes
Wasted compute resources

Solution

Default max_tokens to self.config.response_length (the configured response length)
Apply safety clamp: min(max_tokens, max_possible_tokens) to ensure we never exceed available context space
Honor explicit max_tokens from caller with safety clamping

Changes

verl/workers/rollout/vllm_rollout/vllm_async_server.py: Updated generate() method

… async server Fixes verl-project#4568 Previously, when no max_tokens was provided in sampling_params, the default was calculated as (max_model_len - len(prompt_ids)). For short prompts, this could result in max_tokens being much larger than the intended response_length, causing over-generation and latency issues. Changes: - Default max_tokens to config.response_length instead of remaining context - Apply safety clamp: min(max_tokens, max_possible_tokens) to prevent exceeding available context space - Update variable naming and comments for clarity This ensures short prompts don't cause disproportionate over-generation while still honoring explicit max_tokens when provided. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request correctly fixes a bug in the vLLM async server where short prompts could lead to over-generation due to an incorrect default max_tokens calculation. The new logic properly defaults to the configured response_length and applies a safety clamp to stay within the model's context window. The change is logical and effectively resolves the issue. I have one high-severity suggestion regarding a redundant assertion that should be removed to improve code clarity and maintainability.

gemini-code-assist · 2025-12-28T16:01:43Z

verl/workers/rollout/vllm_rollout/vllm_async_server.py

+        assert max_tokens <= max_possible_tokens, (
+            f"max_tokens {max_tokens} exceeds available context space {max_possible_tokens}"
        )


This assertion will always pass because the preceding line max_tokens = min(max_tokens, max_possible_tokens) ensures that max_tokens is never greater than max_possible_tokens. While this doesn't cause a runtime error, it makes the assertion redundant and can be misleading for future developers who might rely on it as a safeguard. A non-functional check gives a false sense of security and could hide bugs if the clamping logic were to be changed incorrectly in the future. Since the new logic intentionally clamps the value rather than asserting against invalid user input, this assertion should be removed to improve code clarity and maintainability.

…n vLLM async server (verl-project#4703) ## Summary Fixes verl-project#4568 This PR fixes the `max_tokens` calculation bug in the vLLM async server where short prompts could cause over-generation. ### Problem When no `max_tokens` is provided in `sampling_params`, the default was calculated as: ```python response_length = self.config.max_model_len - len(prompt_ids) ``` Since `max_model_len = prompt_length + response_length`, for short prompts this resulted in `max_tokens` being much larger than the intended `response_length`, causing: - Unnecessary over-generation - Latency spikes - Wasted compute resources ### Solution - Default `max_tokens` to `self.config.response_length` (the configured response length) - Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure we never exceed available context space - Honor explicit `max_tokens` from caller with safety clamping ### Changes - `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated `generate()` method ## Test plan - [x] Syntax validation passes - [ ] Manual testing with short prompts to verify reduced max_tokens - [ ] Integration tests with agent loop 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

yurekami requested review from PeterSH6, chenhaiq and wuxibin89 as code owners December 28, 2025 16:00

gemini-code-assist bot reviewed Dec 28, 2025

View reviewed changes

wuxibin89 changed the title ~~fix(rollout): use configured response_length as default max_tokens in vLLM async server~~ [rollout] fix: use configured response_length as default max_tokens in vLLM async server Dec 29, 2025

wuxibin89 approved these changes Dec 29, 2025

View reviewed changes

wuxibin89 merged commit 04a8490 into verl-project:main Dec 29, 2025
64 of 68 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout] fix: use configured response_length as default max_tokens in vLLM async server#4703

[rollout] fix: use configured response_length as default max_tokens in vLLM async server#4703
wuxibin89 merged 1 commit intoverl-project:mainfrom
yurekami:fix/vllm-async-max-tokens-default

yurekami commented Dec 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yurekami commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yurekami commented Dec 28, 2025 •

edited

Loading