[vllm, fully_async] fix: clamp max_tokens to response_length instead of max_model_len - prompt_len in async vLLM rollout by Silas-11 · Pull Request #5505 · verl-project/verl

Silas-11 · 2026-03-05T15:49:08Z

What does this PR do?

Fixes a silent but severe bug in AsyncvLLMServer where max_tokens was incorrectly set to max_model_len - len(prompt_ids) (the entire remaining context window) instead of the configured response_length.

This caused every request to expect generating up to max_model_len - len(prompt_ids) tokens. Under concurrent load, KV cache was exhausted almost immediately, triggering a preemption storm: preempted requests had num_computed_tokens reset to 0, forcing a full re-prefill that consumed even more KV blocks and triggered further preemptions — completely collapsing throughput in Full Async mode. The failure was silent (no errors or warnings), making it extremely difficult to diagnose.

The bug is especially severe when max_model_len is set manually to a value much larger than prompt_length + response_length. This is common for multimodal models like Qwen3-VL: since the current multimodal data filter estimates sequence length from text tokens only, the actual prompt_ids after vision token insertion can significantly exceed the configured prompt_length, requiring a larger max_model_len to avoid context overflow at inference time.

Fixes #5504.

Checklist Before Starting

Search for similar PRs. Query: max_tokens response_length async vllm

Test

Validated on a Qwen3-VL rollout job with Full Async vLLM and the following config:

actor_rollout_ref:
  rollout:
    max_model_len: 32768
    response_length: 2048
    prompt_length: 1024

Before fix: continuous preemption storm in EngineCore logs; KV cache utilization pinned at ~99%; effective throughput near zero.

(EngineCore_DP0) Preempt the request
prompt_tokens=125, num_computed_tokens=159, max_tokens=32643, num_output_tokens=35

(EngineCore_DP0) Preempt the request
prompt_tokens=163, num_computed_tokens=235, max_tokens=32605, num_output_tokens=73

After fix: max_tokens correctly clamped to ≤ response_length; preemption rate drops significantly; KV cache utilization returns to normal; throughput recovers as expected.

API and Usage Example

No API changes. Behavior change: max_tokens per request now correctly defaults to response_length instead of max_model_len - prompt_len.

Explicitly passed max_tokens / max_new_tokens in sampling_params are still honored with priority.

Design & Code Changes

Root cause:

# ❌ Before: sets max_tokens to the entire remaining context window
max_tokens = self.config.max_model_len - len(prompt_ids)

Fix (aligned with the existing correct implementation in vllm_async_server.py):

# Step 1: compute hard upper bound
max_possible_tokens = self.config.max_model_len - len(prompt_ids)
if max_possible_tokens < 0:
    raise ValueError(
        f"Prompt length ({len(prompt_ids)}) exceeds max context length "
        f"({self.config.max_model_len})."
    )

# Step 2: honor explicitly passed values (support both OpenAI and SGLang field names)
if "max_tokens" in sampling_params:
    max_tokens = sampling_params.pop("max_tokens")
elif "max_new_tokens" in sampling_params:
    max_tokens = sampling_params.pop("max_new_tokens")
else:
    # Step 3: default to response_length, not the entire remaining window
    max_tokens = self.config.response_length + self.config.prompt_length - len(prompt_ids)

# Step 4: clamp to valid range [0, max_possible_tokens]
max_tokens = max(0, min(max_tokens, max_possible_tokens))

Files changed:

verl/experimental/fully_async_policy/vllm_rollout/vllm_async_server.py: replace the single-line max_tokens assignment with the above logic

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: the bug is triggered by specific runtime conditions (concurrent requests + large max_model_len gap); a unit test covering the max_tokens clamping logic in the request construction path will be added.
Once your PR is ready for CI, send a message in the ci-request channel.

…wrong max_tokens

gemini-code-assist

Code Review

The pull request effectively addresses a critical bug where max_tokens was incorrectly calculated, leading to KV cache exhaustion and preemption storms. The new logic correctly prioritizes explicit max_tokens or max_new_tokens from sampling_params and includes robust clamping to max_possible_tokens. However, there appears to be a discrepancy between the stated goal of defaulting max_tokens to response_length and its current implementation, which could lead to unexpected token generation lengths under certain conditions.

gemini-code-assist · 2026-03-05T15:54:20Z

+            max_tokens = sampling_params.pop("max_new_tokens")
+        else:
+            # Default to a calculation that considers configured lengths
+            max_tokens = self.config.response_length + self.config.prompt_length - len(prompt_ids)


The PR description states that max_tokens should "default to response_length, not the entire remaining window" (Design & Code Changes, Step 3). However, the current calculation self.config.response_length + self.config.prompt_length - len(prompt_ids) means that if len(prompt_ids) is less than self.config.prompt_length, the default max_tokens will be greater than self.config.response_length. This deviates from the stated objective and could still lead to higher-than-intended KV cache usage if response_length is meant to be a strict upper bound for generated tokens when not explicitly provided.

max_tokens = self.config.response_length

…t overflow deduction

wuxibin89 · 2026-03-06T07:35:30Z

It's intend to max_tokens = self.config.response_length + self.config.prompt_length - len(prompt_ids), we want prompt and response share fixed budget and each sample's max_response_length can be vary according to its prompt.

wuxibin89 · 2026-03-06T07:48:22Z

one-step/fully-async is under refactor #5487

wuxibin89 · 2026-03-06T07:59:05Z

Under concurrent load, KV cache was exhausted almost immediately, triggering a preemption storm

Is there any prometheus metrics that we can monitoring server preemption?

[rollout] fix: Fix full async mode performance degradation caused by …

59a811c

…wrong max_tokens

Silas-11 requested review from ArronHZG and wuxibin89 as code owners March 5, 2026 15:49

gemini-code-assist Bot reviewed Mar 5, 2026

View reviewed changes

[rollout] fix: compute max_tokens based on response_length with promp…

e785d6e

…t overflow deduction

Silas-11 requested review from PeterSH6 and chenhaiq as code owners March 6, 2026 04:36

wuxibin89 closed this Mar 6, 2026

wuxibin89 reopened this Mar 6, 2026

wuxibin89 closed this Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vllm, fully_async] fix: clamp max_tokens to response_length instead of max_model_len - prompt_len in async vLLM rollout#5505

[vllm, fully_async] fix: clamp max_tokens to response_length instead of max_model_len - prompt_len in async vLLM rollout#5505
Silas-11 wants to merge 2 commits intoverl-project:mainfrom
Silas-11:release

Silas-11 commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Uh oh!

wuxibin89 commented Mar 6, 2026

Uh oh!

wuxibin89 commented Mar 6, 2026

Uh oh!

wuxibin89 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Silas-11 commented Mar 5, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Mar 6, 2026

Uh oh!

wuxibin89 commented Mar 6, 2026

Uh oh!

wuxibin89 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants