Skip to content

[rollout] fix: use configured response_length as default max_tokens in vLLM async server#4703

Merged
wuxibin89 merged 1 commit intoverl-project:mainfrom
yurekami:fix/vllm-async-max-tokens-default
Dec 29, 2025
Merged

[rollout] fix: use configured response_length as default max_tokens in vLLM async server#4703
wuxibin89 merged 1 commit intoverl-project:mainfrom
yurekami:fix/vllm-async-max-tokens-default

Conversation

@yurekami
Copy link
Contributor

@yurekami yurekami commented Dec 28, 2025

Summary

Fixes #4568

This PR fixes the max_tokens calculation bug in the vLLM async server where short prompts could cause over-generation.

Problem

When no max_tokens is provided in sampling_params, the default was calculated as:

response_length = self.config.max_model_len - len(prompt_ids)

Since max_model_len = prompt_length + response_length, for short prompts this resulted in max_tokens being much larger than the intended response_length, causing:

  • Unnecessary over-generation
  • Latency spikes
  • Wasted compute resources

Solution

  • Default max_tokens to self.config.response_length (the configured response length)
  • Apply safety clamp: min(max_tokens, max_possible_tokens) to ensure we never exceed available context space
  • Honor explicit max_tokens from caller with safety clamping

Changes

  • verl/workers/rollout/vllm_rollout/vllm_async_server.py: Updated generate() method

… async server

Fixes verl-project#4568

Previously, when no max_tokens was provided in sampling_params, the default
was calculated as (max_model_len - len(prompt_ids)). For short prompts,
this could result in max_tokens being much larger than the intended
response_length, causing over-generation and latency issues.

Changes:
- Default max_tokens to config.response_length instead of remaining context
- Apply safety clamp: min(max_tokens, max_possible_tokens) to prevent
  exceeding available context space
- Update variable naming and comments for clarity

This ensures short prompts don't cause disproportionate over-generation
while still honoring explicit max_tokens when provided.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug in the vLLM async server where short prompts could lead to over-generation due to an incorrect default max_tokens calculation. The new logic properly defaults to the configured response_length and applies a safety clamp to stay within the model's context window. The change is logical and effectively resolves the issue. I have one high-severity suggestion regarding a redundant assertion that should be removed to improve code clarity and maintainability.

Comment on lines +482 to 484
assert max_tokens <= max_possible_tokens, (
f"max_tokens {max_tokens} exceeds available context space {max_possible_tokens}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This assertion will always pass because the preceding line max_tokens = min(max_tokens, max_possible_tokens) ensures that max_tokens is never greater than max_possible_tokens. While this doesn't cause a runtime error, it makes the assertion redundant and can be misleading for future developers who might rely on it as a safeguard. A non-functional check gives a false sense of security and could hide bugs if the clamping logic were to be changed incorrectly in the future. Since the new logic intentionally clamps the value rather than asserting against invalid user input, this assertion should be removed to improve code clarity and maintainability.

@wuxibin89 wuxibin89 changed the title fix(rollout): use configured response_length as default max_tokens in vLLM async server [rollout] fix: use configured response_length as default max_tokens in vLLM async server Dec 29, 2025
@wuxibin89 wuxibin89 merged commit 04a8490 into verl-project:main Dec 29, 2025
64 of 68 checks passed
boren-ms pushed a commit to boren-ms/verl that referenced this pull request Dec 30, 2025
…n vLLM async server (verl-project#4703)

## Summary

Fixes verl-project#4568

This PR fixes the `max_tokens` calculation bug in the vLLM async server
where short prompts could cause over-generation.

### Problem
When no `max_tokens` is provided in `sampling_params`, the default was
calculated as:
```python
response_length = self.config.max_model_len - len(prompt_ids)
```

Since `max_model_len = prompt_length + response_length`, for short
prompts this resulted in `max_tokens` being much larger than the
intended `response_length`, causing:
- Unnecessary over-generation
- Latency spikes
- Wasted compute resources

### Solution
- Default `max_tokens` to `self.config.response_length` (the configured
response length)
- Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure
we never exceed available context space
- Honor explicit `max_tokens` from caller with safety clamping

### Changes
- `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated
`generate()` method

## Test plan
- [x] Syntax validation passes
- [ ] Manual testing with short prompts to verify reduced max_tokens
- [ ] Integration tests with agent loop

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jsfanfanfan pushed a commit to meituan-search/verl that referenced this pull request Jan 9, 2026
…n vLLM async server (verl-project#4703)

## Summary

Fixes verl-project#4568

This PR fixes the `max_tokens` calculation bug in the vLLM async server
where short prompts could cause over-generation.

### Problem
When no `max_tokens` is provided in `sampling_params`, the default was
calculated as:
```python
response_length = self.config.max_model_len - len(prompt_ids)
```

Since `max_model_len = prompt_length + response_length`, for short
prompts this resulted in `max_tokens` being much larger than the
intended `response_length`, causing:
- Unnecessary over-generation
- Latency spikes
- Wasted compute resources

### Solution
- Default `max_tokens` to `self.config.response_length` (the configured
response length)
- Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure
we never exceed available context space
- Honor explicit `max_tokens` from caller with safety clamping

### Changes
- `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated
`generate()` method

## Test plan
- [x] Syntax validation passes
- [ ] Manual testing with short prompts to verify reduced max_tokens
- [ ] Integration tests with agent loop

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…n vLLM async server (verl-project#4703)

## Summary

Fixes verl-project#4568

This PR fixes the `max_tokens` calculation bug in the vLLM async server
where short prompts could cause over-generation.

### Problem
When no `max_tokens` is provided in `sampling_params`, the default was
calculated as:
```python
response_length = self.config.max_model_len - len(prompt_ids)
```

Since `max_model_len = prompt_length + response_length`, for short
prompts this resulted in `max_tokens` being much larger than the
intended `response_length`, causing:
- Unnecessary over-generation
- Latency spikes
- Wasted compute resources

### Solution
- Default `max_tokens` to `self.config.response_length` (the configured
response length)
- Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure
we never exceed available context space
- Honor explicit `max_tokens` from caller with safety clamping

### Changes
- `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated
`generate()` method

## Test plan
- [x] Syntax validation passes
- [ ] Manual testing with short prompts to verify reduced max_tokens
- [ ] Integration tests with agent loop

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026
…n vLLM async server (verl-project#4703)

## Summary

Fixes verl-project#4568

This PR fixes the `max_tokens` calculation bug in the vLLM async server
where short prompts could cause over-generation.

### Problem
When no `max_tokens` is provided in `sampling_params`, the default was
calculated as:
```python
response_length = self.config.max_model_len - len(prompt_ids)
```

Since `max_model_len = prompt_length + response_length`, for short
prompts this resulted in `max_tokens` being much larger than the
intended `response_length`, causing:
- Unnecessary over-generation
- Latency spikes
- Wasted compute resources

### Solution
- Default `max_tokens` to `self.config.response_length` (the configured
response length)
- Apply safety clamp: `min(max_tokens, max_possible_tokens)` to ensure
we never exceed available context space
- Honor explicit `max_tokens` from caller with safety clamping

### Changes
- `verl/workers/rollout/vllm_rollout/vllm_async_server.py`: Updated
`generate()` method

## Test plan
- [x] Syntax validation passes
- [ ] Manual testing with short prompts to verify reduced max_tokens
- [ ] Integration tests with agent loop

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async rollout (AgentLoop + vLLM async server): max_tokens per generation is computed incorrectly, causing over-generation and slowdowns

2 participants