Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions vllm/v1/sample/ops/penalties.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@ def apply_all_penalties(
"""
_, vocab_size = logits.shape
output_tokens_t = _convert_to_tensors(output_token_ids, vocab_size, logits.device)

# In the async scheduling case, rows that won't have penalties applied may contain
# -1 placeholder token ids. We must replace these with valid token ids so that the
# scatter done in apply_penalties is valid.
# NOTE(nick): The penalties implementation is currently quite inefficient and
# will be reworked anyhow.
output_tokens_t.masked_fill_(output_tokens_t == -1, vocab_size)
Copy link
Copy Markdown
Contributor

@hidva hidva Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I’d like to ask why the actual draft token isn’t used here to replace the placeholder? Thanks.

@njhill

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hidva this line should only apply to requests/rows in the batch which don't require the output tokens (i.e. those which don't use penalties sampling parameters). The other rows should not contain any placeholder tokens at this point.

Also async scheduling + spec decode + penalties isn't yet supported (any help with that appreciated though - see discussion in #30122).


return apply_penalties(
logits,
prompt_token_ids,
Expand Down