[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor#34870
[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor#34870vllm-bot merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Huamin Li <3ericli@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization within the mamba_get_block_table_tensor function. The change modifies the creation of the offsets tensor to use torch.int32, which avoids a dtype promotion to int64 during the subsequent addition operation. The final result is then explicitly cast to torch.int64 to meet the requirements of torch.gather. This is a sound optimization that should reduce overhead as intended. The implementation appears correct and I found no issues.
heheda12345
left a comment
There was a problem hiding this comment.
Nice finding! Thanks for your contribution.
|
CI failures are probably from #33600 |
|
failure from |
…lm-project#34870) Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…lm-project#34870) Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…lm-project#34870) Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…lm-project#34870) Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…lm-project#34870) Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…lm-project#34870) Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…lm-project#34870) Signed-off-by: Huamin Li <3ericli@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
From the perf profiling on a model using linear attention, one thing draws the attention is
aten::tofrom somewhere in vllm gpu_model_runner preprocessFor this

aten::to,We can tell it is doing to with dtype=3 (int64)
With CC's help, we found it is from
mamba_get_block_table_tensorstart_indicesisint32offsetsisint64indices_to_gather = start_indices.unsqueeze(1) + offsetswill perform int32+int64 and syncNow we made
offsetsas int32 as well, and then no sync.With the change, new profiling looks much better
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.