Conversation
…m-project#6874) ### What this PR does / why we need it? This PR refactors sfa_v1.py to improve code readability and usability, fixes a code bug, and enhances performance through the replacement of certain operators. ### changes - **improve code readability**: Optimizes parts of the code structure in sfa_v1.py, supplementary comments for key code blocks, removes some unused variables, and improves the naming of certain functions and variables. - **resolved a duplicated double write to k_cache**: Fixed redundant double writes of k_cache in the indexer_select module (in both the `forward` function and `indexer_select_post_process`), improving performance to some extent. - **replace `scatter` ops with `reshape_and_cache`**: This optimization replaces two separate cache storage operations on `k_nope` and `k_pe` with a single call to the `reshape_and_cache` operator, improving performance. The original `scatter` operator involves reordering slot_mapping for generality, introducing significant scalar computations. In contrast, the `reshape_and_cache` operator eliminates this redundant reordering step, thus reducing unnecessary computation time and enhancing the operator's performance. ### performance comparison 4*A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K origin: TTFT: **28s**, TPOT: 26ms, TPS: **820 token/s** fixed redundant double writes of k_cache: TTFT: **24s**, TPOT: 26ms, TPS: **840 token/s** replace scatter ops with reshape_and_cache: TTFT: **24s**, TPOT: 26ms, TPS: **850 token/s** ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 --------- Signed-off-by: rjg-lyh <1318825571@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
This PR refactors sfa_v1.py to improve code readability and usability, fixes a code bug, and enhances performance through the replacement of certain operators.
changes
improve code readability: Optimizes parts of the code structure in sfa_v1.py, supplementary comments for key code blocks, removes some unused variables, and improves the naming of certain functions and variables.
resolved a duplicated double write to k_cache: Fixed redundant double writes of k_cache in the indexer_select module (in both the
forwardfunction andindexer_select_post_process), improving performance to some extent.replace
scatterops withreshape_and_cache: This optimization replaces two separate cache storage operations onk_nopeandk_pewith a single call to thereshape_and_cacheoperator, improving performance. The originalscatteroperator involves reordering slot_mapping for generality, introducing significant scalar computations. In contrast, thereshape_and_cacheoperator eliminates this redundant reordering step, thus reducing unnecessary computation time and enhancing the operator's performance.performance comparison
4*A3, 1P1D, P dp2tp16, D dp8tp4, input/output: 64K/3K origin:
TTFT: 28s, TPOT: 26ms, TPS: 820 token/s
fixed redundant double writes of k_cache:
TTFT: 24s, TPOT: 26ms, TPS: 840 token/s
replace scatter ops with reshape_and_cache:
TTFT: 24s, TPOT: 26ms, TPS: 850 token/s
Does this PR introduce any user-facing change? No.
How was this patch tested?
CI passed with new added/existing test.
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?