Skip to content

Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco…#36757

Closed
xueliangyang-oeuler wants to merge 1 commit intovllm-project:mainfrom
xueliangyang-oeuler:fix-#36755
Closed

Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco…#36757
xueliangyang-oeuler wants to merge 1 commit intovllm-project:mainfrom
xueliangyang-oeuler:fix-#36755

Conversation

@xueliangyang-oeuler
Copy link
Copy Markdown

@xueliangyang-oeuler xueliangyang-oeuler commented Mar 11, 2026

…unting crash (#36755)

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a potential crash during preemption by resetting num_cached_tokens. It also includes a fix for a potential dtype mismatch in the TRT-LLM FP8 MoE expert implementation. The changes appear correct and improve the robustness of the scheduler and model execution layers. I have added one comment suggesting a related improvement for consistency.

Note: Security Review did not run due to the size of the PR.

Comment on lines +179 to +181
if e_score_correction_bias is not None:
e_score_correction_bias = e_score_correction_bias.to(hidden_states.dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This explicit type casting is a good safeguard against potential dtype mismatches. For consistency, it would be beneficial to apply the same logic to the _apply_per_tensor method in this file, which also uses e_score_correction_bias but currently lacks this explicit cast. This would improve robustness and prevent similar potential issues there.

For example, you could add this to _apply_per_tensor:

if e_score_correction_bias is not None:
    e_score_correction_bias = e_score_correction_bias.to(hidden_states.dtype)

markmc added a commit to markmc/vllm that referenced this pull request Mar 11, 2026
… non-negative amounts"

Since `num_computed_tokens`, `num_cached_tokens`, and
`num_external_computed_tokens` accounting seems quite brittle currently -
with preemption reset bugs and P/D disaggregation accounting issues -
add a defensive check to detect and prevent instances of Prometheus
counter errors:

```
ValueError: Counters can only be incremented by non-negative amounts
```

The invariant check enforces:

```
prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0
```

with the additional nuance that when all tokens are cached, the scheduler
forces recomputation of the last token, so the:

```
num_external_computed_tokens <= num_cached_tokens + recomputed
```

When the invariant is violated, we log a a warning once with diagnostic
details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and
future-proof, at which point we can remove this check (perhaps replacing
with a simple assertion).

Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…orrection_bias dtype conversion (vllm-project#36755)

Signed-off-by: xueliangyang-oeuler <yxl546827391@gmail.com>
@hmellor hmellor added the closed-as-slop Pull request determined to be low effort and agent generated label Mar 25, 2026
@hmellor hmellor closed this Mar 25, 2026
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA Mar 25, 2026
@markmc markmc moved this from Backlog to In Review in Metrics & Tracing Apr 8, 2026
@markmc markmc moved this from In Review to Not planned in Metrics & Tracing Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closed-as-slop Pull request determined to be low effort and agent generated nvidia v1

Projects

Status: Not planned
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants