Skip to content

[vllm, fully_async] fix: clamp max_tokens to response_length instead of max_model_len - prompt_len in async vLLM rollout#5505

Closed
Silas-11 wants to merge 2 commits intoverl-project:mainfrom
Silas-11:release
Closed

[vllm, fully_async] fix: clamp max_tokens to response_length instead of max_model_len - prompt_len in async vLLM rollout#5505
Silas-11 wants to merge 2 commits intoverl-project:mainfrom
Silas-11:release

Conversation

@Silas-11
Copy link
Copy Markdown
Contributor

@Silas-11 Silas-11 commented Mar 5, 2026

What does this PR do?

Fixes a silent but severe bug in AsyncvLLMServer where max_tokens was incorrectly set to max_model_len - len(prompt_ids) (the entire remaining context window) instead of the configured response_length.

This caused every request to expect generating up to max_model_len - len(prompt_ids) tokens. Under concurrent load, KV cache was exhausted almost immediately, triggering a preemption storm: preempted requests had num_computed_tokens reset to 0, forcing a full re-prefill that consumed even more KV blocks and triggered further preemptions — completely collapsing throughput in Full Async mode. The failure was silent (no errors or warnings), making it extremely difficult to diagnose.

The bug is especially severe when max_model_len is set manually to a value much larger than prompt_length + response_length. This is common for multimodal models like Qwen3-VL: since the current multimodal data filter estimates sequence length from text tokens only, the actual prompt_ids after vision token insertion can significantly exceed the configured prompt_length, requiring a larger max_model_len to avoid context overflow at inference time.

Fixes #5504.


Checklist Before Starting


Test

Validated on a Qwen3-VL rollout job with Full Async vLLM and the following config:

actor_rollout_ref:
  rollout:
    max_model_len: 32768
    response_length: 2048
    prompt_length: 1024

Before fix: continuous preemption storm in EngineCore logs; KV cache utilization pinned at ~99%; effective throughput near zero.

(EngineCore_DP0) Preempt the request
prompt_tokens=125, num_computed_tokens=159, max_tokens=32643, num_output_tokens=35

(EngineCore_DP0) Preempt the request
prompt_tokens=163, num_computed_tokens=235, max_tokens=32605, num_output_tokens=73

After fix: max_tokens correctly clamped to ≤ response_length; preemption rate drops significantly; KV cache utilization returns to normal; throughput recovers as expected.


API and Usage Example

No API changes. Behavior change: max_tokens per request now correctly defaults to response_length instead of max_model_len - prompt_len.

Explicitly passed max_tokens / max_new_tokens in sampling_params are still honored with priority.


Design & Code Changes

Root cause:

# ❌ Before: sets max_tokens to the entire remaining context window
max_tokens = self.config.max_model_len - len(prompt_ids)

Fix (aligned with the existing correct implementation in vllm_async_server.py):

# Step 1: compute hard upper bound
max_possible_tokens = self.config.max_model_len - len(prompt_ids)
if max_possible_tokens < 0:
    raise ValueError(
        f"Prompt length ({len(prompt_ids)}) exceeds max context length "
        f"({self.config.max_model_len})."
    )

# Step 2: honor explicitly passed values (support both OpenAI and SGLang field names)
if "max_tokens" in sampling_params:
    max_tokens = sampling_params.pop("max_tokens")
elif "max_new_tokens" in sampling_params:
    max_tokens = sampling_params.pop("max_new_tokens")
else:
    # Step 3: default to response_length, not the entire remaining window
    max_tokens = self.config.response_length + self.config.prompt_length - len(prompt_ids)

# Step 4: clamp to valid range [0, max_possible_tokens]
max_tokens = max(0, min(max_tokens, max_possible_tokens))

Files changed:

  • verl/experimental/fully_async_policy/vllm_rollout/vllm_async_server.py: replace the single-line max_tokens assignment with the above logic

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add / Update the documentation.
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: the bug is triggered by specific runtime conditions (concurrent requests + large max_model_len gap); a unit test covering the max_tokens clamping logic in the request construction path will be added.
  • Once your PR is ready for CI, send a message in the ci-request channel.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses a critical bug where max_tokens was incorrectly calculated, leading to KV cache exhaustion and preemption storms. The new logic correctly prioritizes explicit max_tokens or max_new_tokens from sampling_params and includes robust clamping to max_possible_tokens. However, there appears to be a discrepancy between the stated goal of defaulting max_tokens to response_length and its current implementation, which could lead to unexpected token generation lengths under certain conditions.

max_tokens = sampling_params.pop("max_new_tokens")
else:
# Default to a calculation that considers configured lengths
max_tokens = self.config.response_length + self.config.prompt_length - len(prompt_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR description states that max_tokens should "default to response_length, not the entire remaining window" (Design & Code Changes, Step 3). However, the current calculation self.config.response_length + self.config.prompt_length - len(prompt_ids) means that if len(prompt_ids) is less than self.config.prompt_length, the default max_tokens will be greater than self.config.response_length. This deviates from the stated objective and could still lead to higher-than-intended KV cache usage if response_length is meant to be a strict upper bound for generated tokens when not explicitly provided.

max_tokens = self.config.response_length

@wuxibin89
Copy link
Copy Markdown
Collaborator

It's intend to max_tokens = self.config.response_length + self.config.prompt_length - len(prompt_ids), we want prompt and response share fixed budget and each sample's max_response_length can be vary according to its prompt.
img_v3_02vh_ac7e4bdd-518c-4c08-82e4-cd3e43d908fg

@wuxibin89
Copy link
Copy Markdown
Collaborator

one-step/fully-async is under refactor #5487

@wuxibin89 wuxibin89 closed this Mar 6, 2026
@wuxibin89 wuxibin89 reopened this Mar 6, 2026
@wuxibin89
Copy link
Copy Markdown
Collaborator

Under concurrent load, KV cache was exhausted almost immediately, triggering a preemption storm

Is there any prometheus metrics that we can monitoring server preemption?

@wuxibin89 wuxibin89 closed this Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants