optimize performance of ascend backend's update_step_context() by calculating kv_start_indices in a new way #2521
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Through practical tests on ascend platform, it is found that the
AscendOpsBackend.update_step_context()
function has some room for performance optimization, especially in the prefill stage and when the input sequence is long (for example, when the input sequence contains more image tokens). Because the original code uses a slower loop to calculate kv_start_indices.The goal of this PR is to improve the performance of the
AscendOpsBackend.update_step_context()
function with more efficient code.Modification
This PR is only changed the
lmdeploy/pytorch/backends/ascend/op_backend.py
file. The main modification is to optimize thekv_start_indices
calculation process from a normal loop to usingslot_tables
andslot_indices
(inspired byatb_models/examples/server
in the Ascend mindie-atb_models package).When running 8B's VL model on ascend910b, if the input length of a request reaches 2000+ (including image tokens), In the prefill stage, the
AscendOpsBackend.update_step_context()
function can be reduced from 160ms+ to about 80ms, which is beneficial in the case of strict TTFT requirements.