Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco…#36757
Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco…#36757xueliangyang-oeuler wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a potential crash during preemption by resetting num_cached_tokens. It also includes a fix for a potential dtype mismatch in the TRT-LLM FP8 MoE expert implementation. The changes appear correct and improve the robustness of the scheduler and model execution layers. I have added one comment suggesting a related improvement for consistency.
Note: Security Review did not run due to the size of the PR.
| if e_score_correction_bias is not None: | ||
| e_score_correction_bias = e_score_correction_bias.to(hidden_states.dtype) | ||
|
|
There was a problem hiding this comment.
This explicit type casting is a good safeguard against potential dtype mismatches. For consistency, it would be beneficial to apply the same logic to the _apply_per_tensor method in this file, which also uses e_score_correction_bias but currently lacks this explicit cast. This would improve robustness and prevent similar potential issues there.
For example, you could add this to _apply_per_tensor:
if e_score_correction_bias is not None:
e_score_correction_bias = e_score_correction_bias.to(hidden_states.dtype)… non-negative amounts" Since `num_computed_tokens`, `num_cached_tokens`, and `num_external_computed_tokens` accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors: ``` ValueError: Counters can only be incremented by non-negative amounts ``` The invariant check enforces: ``` prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0 ``` with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the: ``` num_external_computed_tokens <= num_cached_tokens + recomputed ``` When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics. Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion). Related to issues vllm-project#36533, vllm-project#36755 and PRs vllm-project#36638, vllm-project#36752, vllm-project#36757. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…orrection_bias dtype conversion (vllm-project#36755) Signed-off-by: xueliangyang-oeuler <yxl546827391@gmail.com>
…unting crash (#36755)
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.