[rollout] fix: use configured response_length as default max_tokens in vLLM async server#4703
Conversation
… async server Fixes verl-project#4568 Previously, when no max_tokens was provided in sampling_params, the default was calculated as (max_model_len - len(prompt_ids)). For short prompts, this could result in max_tokens being much larger than the intended response_length, causing over-generation and latency issues. Changes: - Default max_tokens to config.response_length instead of remaining context - Apply safety clamp: min(max_tokens, max_possible_tokens) to prevent exceeding available context space - Update variable naming and comments for clarity This ensures short prompts don't cause disproportionate over-generation while still honoring explicit max_tokens when provided. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug in the vLLM async server where short prompts could lead to over-generation due to an incorrect default max_tokens calculation. The new logic properly defaults to the configured response_length and applies a safety clamp to stay within the model's context window. The change is logical and effectively resolves the issue. I have one high-severity suggestion regarding a redundant assertion that should be removed to improve code clarity and maintainability.
| assert max_tokens <= max_possible_tokens, ( | ||
| f"max_tokens {max_tokens} exceeds available context space {max_possible_tokens}" | ||
| ) |
There was a problem hiding this comment.
This assertion will always pass because the preceding line max_tokens = min(max_tokens, max_possible_tokens) ensures that max_tokens is never greater than max_possible_tokens. While this doesn't cause a runtime error, it makes the assertion redundant and can be misleading for future developers who might rely on it as a safeguard. A non-functional check gives a false sense of security and could hide bugs if the clamping logic were to be changed incorrectly in the future. Since the new logic intentionally clamps the value rather than asserting against invalid user input, this assertion should be removed to improve code clarity and maintainability.
…n vLLM async server (verl-project#4703) ## Summary Fixes verl-project#4568 This PR fixes the `max_tokens` calculation bug in the vLLM async server where short prompts could cause over-generation. ### Problem When no `max_tokens` is provided in `sampling_params`, the default was calculated as: ```python response_length = self.config.max_model_len - len(prompt_ids) ``` Since `max_model_len = prompt_length + response_length`, for short prompts this resulted in `max_tokens` being much larger than the intended `response_length`, causing: - Unnecessary over-generation - Latency spikes - Wasted compute resources ### Solution - Default `max_tokens` to `self.config.response_length` (the configured response length) - Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure we never exceed available context space - Honor explicit `max_tokens` from caller with safety clamping ### Changes - `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated `generate()` method ## Test plan - [x] Syntax validation passes - [ ] Manual testing with short prompts to verify reduced max_tokens - [ ] Integration tests with agent loop 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…n vLLM async server (verl-project#4703) ## Summary Fixes verl-project#4568 This PR fixes the `max_tokens` calculation bug in the vLLM async server where short prompts could cause over-generation. ### Problem When no `max_tokens` is provided in `sampling_params`, the default was calculated as: ```python response_length = self.config.max_model_len - len(prompt_ids) ``` Since `max_model_len = prompt_length + response_length`, for short prompts this resulted in `max_tokens` being much larger than the intended `response_length`, causing: - Unnecessary over-generation - Latency spikes - Wasted compute resources ### Solution - Default `max_tokens` to `self.config.response_length` (the configured response length) - Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure we never exceed available context space - Honor explicit `max_tokens` from caller with safety clamping ### Changes - `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated `generate()` method ## Test plan - [x] Syntax validation passes - [ ] Manual testing with short prompts to verify reduced max_tokens - [ ] Integration tests with agent loop 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…n vLLM async server (verl-project#4703) ## Summary Fixes verl-project#4568 This PR fixes the `max_tokens` calculation bug in the vLLM async server where short prompts could cause over-generation. ### Problem When no `max_tokens` is provided in `sampling_params`, the default was calculated as: ```python response_length = self.config.max_model_len - len(prompt_ids) ``` Since `max_model_len = prompt_length + response_length`, for short prompts this resulted in `max_tokens` being much larger than the intended `response_length`, causing: - Unnecessary over-generation - Latency spikes - Wasted compute resources ### Solution - Default `max_tokens` to `self.config.response_length` (the configured response length) - Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure we never exceed available context space - Honor explicit `max_tokens` from caller with safety clamping ### Changes - `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated `generate()` method ## Test plan - [x] Syntax validation passes - [ ] Manual testing with short prompts to verify reduced max_tokens - [ ] Integration tests with agent loop 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…n vLLM async server (verl-project#4703) ## Summary Fixes verl-project#4568 This PR fixes the `max_tokens` calculation bug in the vLLM async server where short prompts could cause over-generation. ### Problem When no `max_tokens` is provided in `sampling_params`, the default was calculated as: ```python response_length = self.config.max_model_len - len(prompt_ids) ``` Since `max_model_len = prompt_length + response_length`, for short prompts this resulted in `max_tokens` being much larger than the intended `response_length`, causing: - Unnecessary over-generation - Latency spikes - Wasted compute resources ### Solution - Default `max_tokens` to `self.config.response_length` (the configured response length) - Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure we never exceed available context space - Honor explicit `max_tokens` from caller with safety clamping ### Changes - `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated `generate()` method ## Test plan - [x] Syntax validation passes - [ ] Manual testing with short prompts to verify reduced max_tokens - [ ] Integration tests with agent loop 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Fixes #4568
This PR fixes the
max_tokenscalculation bug in the vLLM async server where short prompts could cause over-generation.Problem
When no
max_tokensis provided insampling_params, the default was calculated as:Since
max_model_len = prompt_length + response_length, for short prompts this resulted inmax_tokensbeing much larger than the intendedresponse_length, causing:Solution
max_tokenstoself.config.response_length(the configured response length)min(max_tokens, max_possible_tokens)to ensure we never exceed available context spacemax_tokensfrom caller with safety clampingChanges
verl/workers/rollout/vllm_rollout/vllm_async_server.py: Updatedgenerate()method