[BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run#34121
[BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run#34121Selkh wants to merge 0 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request fixes a bug in _dummy_run where num_reqs_padded was incorrectly passed as the num_reqs argument to _build_attention_metadata. While this change is correct, the call is still incomplete as it's missing the num_tokens_padded and num_reqs_padded arguments, which can lead to incorrect behavior when CUDA graph padding is enabled. I've suggested a more complete fix to ensure that padded values are used correctly when pad_attn is true.
|
This pull request has merge conflicts that must be resolved before it can be |
|
@LucasWilkinson Still not fixed after #34187 when SP & uniform batch backend |
LucasWilkinson
left a comment
There was a problem hiding this comment.
thanks for the contribution! lets just make this fully match execute model, i.e.
num_reqs=num_reqs,
num_reqs_padded=num_reqs_padded if pad_attn else None,
"dummy_run" may need unpadded num_reqs and num_tokens, rather than matching execute_model. Consider the following two scenarios:
|
can you please provide a reproducer? |
|
maybe similar to #35243 |
Purpose
In "_dummy_run", "num_tokens_padded" was mistakenly passed as num_tokens, leading to an Assertion Error when "split_decodes_and_prefills" for an attention backend "requires uniform" when "enable_sp" is ON.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.