[vllm, fully_async] fix: clamp max_tokens to response_length instead of max_model_len - prompt_len in async vLLM rollout#5505
Conversation
There was a problem hiding this comment.
Code Review
The pull request effectively addresses a critical bug where max_tokens was incorrectly calculated, leading to KV cache exhaustion and preemption storms. The new logic correctly prioritizes explicit max_tokens or max_new_tokens from sampling_params and includes robust clamping to max_possible_tokens. However, there appears to be a discrepancy between the stated goal of defaulting max_tokens to response_length and its current implementation, which could lead to unexpected token generation lengths under certain conditions.
| max_tokens = sampling_params.pop("max_new_tokens") | ||
| else: | ||
| # Default to a calculation that considers configured lengths | ||
| max_tokens = self.config.response_length + self.config.prompt_length - len(prompt_ids) |
There was a problem hiding this comment.
The PR description states that max_tokens should "default to response_length, not the entire remaining window" (Design & Code Changes, Step 3). However, the current calculation self.config.response_length + self.config.prompt_length - len(prompt_ids) means that if len(prompt_ids) is less than self.config.prompt_length, the default max_tokens will be greater than self.config.response_length. This deviates from the stated objective and could still lead to higher-than-intended KV cache usage if response_length is meant to be a strict upper bound for generated tokens when not explicitly provided.
max_tokens = self.config.response_length…t overflow deduction
|
one-step/fully-async is under refactor #5487 |
Is there any prometheus metrics that we can monitoring server preemption? |

What does this PR do?
Fixes a silent but severe bug in
AsyncvLLMServerwheremax_tokenswas incorrectly set tomax_model_len - len(prompt_ids)(the entire remaining context window) instead of the configuredresponse_length.This caused every request to expect generating up to
max_model_len - len(prompt_ids)tokens. Under concurrent load, KV cache was exhausted almost immediately, triggering a preemption storm: preempted requests hadnum_computed_tokensreset to 0, forcing a full re-prefill that consumed even more KV blocks and triggered further preemptions — completely collapsing throughput in Full Async mode. The failure was silent (no errors or warnings), making it extremely difficult to diagnose.The bug is especially severe when
max_model_lenis set manually to a value much larger thanprompt_length + response_length. This is common for multimodal models like Qwen3-VL: since the current multimodal data filter estimates sequence length from text tokens only, the actualprompt_idsafter vision token insertion can significantly exceed the configuredprompt_length, requiring a largermax_model_lento avoid context overflow at inference time.Fixes #5504.
Checklist Before Starting
max_tokens response_length async vllmTest
Validated on a Qwen3-VL rollout job with Full Async vLLM and the following config:
Before fix: continuous preemption storm in EngineCore logs; KV cache utilization pinned at ~99%; effective throughput near zero.
After fix:
max_tokenscorrectly clamped to≤ response_length; preemption rate drops significantly; KV cache utilization returns to normal; throughput recovers as expected.API and Usage Example
No API changes. Behavior change:
max_tokensper request now correctly defaults toresponse_lengthinstead ofmax_model_len - prompt_len.Explicitly passed
max_tokens/max_new_tokensinsampling_paramsare still honored with priority.Design & Code Changes
Root cause:
Fix (aligned with the existing correct implementation in
vllm_async_server.py):Files changed:
verl/experimental/fully_async_policy/vllm_rollout/vllm_async_server.py: replace the single-linemax_tokensassignment with the above logicChecklist Before Submitting
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysmax_model_lengap); a unit test covering themax_tokensclamping logic in the request construction path will be added.ci-requestchannel.