Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -5138,6 +5138,18 @@ def _dummy_run(
self.query_start_loc.copy_to_gpu()

pad_attn = cudagraph_runtime_mode == CUDAGraphMode.FULL
# populate buffers so that GDN attention triggers JIT complication
# of spec-decode before graph capture
if (
uniform_decode
and self.speculative_config is not None
and max_query_len > 1
):
self.input_batch.num_accepted_tokens_cpu[:num_reqs] = max_query_len
self.num_decode_draft_tokens.np[:num_reqs] = max_query_len - 1
self.num_decode_draft_tokens.np[num_reqs:].fill(-1)
Comment on lines +5148 to +5150
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the choice of the values in each of these cases? Is the .fill(-1) the correct convention?

Copy link
Contributor Author

@flutist flutist Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_accepted_tokens_cpu[:num_reqs] = max_query_len
Total tokens per request in spec decode = 1 original + (max_query_len−1) draft. Setting to max_query_len = "all accepted" — dummy value to trigger spec-decode Triton JIT warmup.
num_decode_draft_tokens.np[:num_reqs] = max_query_len - 1
Draft count = total − 1. Lets GDN attention see IS_SPEC_DECODING=True
self.num_decode_draft_tokens.np[num_reqs:].fill(-1)
-1 = "not a decode request". Consumer uses >= 0 as mask:
spec_sequence_masks_cpu = num_decode_draft_tokens_cpu >= 0 # True=decode, False=prefill/unused
0 would be wrong — indistinguishable from "decode with 0 drafts". Unused padding slots must be -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benchislett Sorry to bother you despite your busy schedule. Could you please take a look and let me know if there's anything else I need to modify? If you feel everything is okay, could you help approve this PR when you have a moment?

self.num_decode_draft_tokens.copy_to_gpu()

attn_metadata, _ = self._build_attention_metadata(
num_tokens=num_tokens_unpadded,
num_tokens_padded=num_tokens_padded if pad_attn else None,
Expand Down
Loading