Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions vllm_ascend/core/recompute_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,10 +235,10 @@ def schedule(self) -> SchedulerOutput:
if preempted_encoder_inputs:
# Restore encoder compute budget if the preempted
# request had encoder inputs scheduled in this step.
num_tokens_to_restore = sum(
preempted_req.get_num_encoder_tokens(i)
num_embeds_to_restore = sum(
preempted_req.get_num_encoder_embeds(i)
for i in preempted_encoder_inputs)
encoder_compute_budget += num_tokens_to_restore
encoder_compute_budget += num_embeds_to_restore
Comment on lines +238 to +241
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a potential unit mismatch in the accounting of encoder_compute_budget. This change restores the budget using the number of embeddings (get_num_encoder_embeds), which implies the budget is tracked in units of embeddings.

However, encoder_compute_budget is initialized with self.max_num_encoder_input_tokens on line 116. This same configuration value is also used as num_encoder_tokens when calling self.kv_cache_manager.allocate_slots on lines 451-458. This suggests the budget is still measured in tokens.

If one embedding does not correspond to exactly one token, this will lead to incorrect budget calculations, potentially causing scheduling failures or overallocation.

For example:

  • If encoder_compute_budget is in tokens, then restoring it with the number of embeddings is incorrect. It should be restored with the number of tokens corresponding to the preempted embeddings.
  • If encoder_compute_budget is meant to be in embeddings, then the configuration max_num_encoder_input_tokens is misnamed and its usage for num_encoder_tokens in allocate_slots is likely incorrect, as that method probably expects a token count for allocating cross-attention KV cache.

This inconsistency could lead to critical issues. Please verify the units used for encoder_compute_budget throughout the scheduler and ensure they are consistent.

req_index -= 1
else:
preempted_req = self.running.pop()
Expand Down
Loading