Enable slicing for the FusedSDPA#1155
Merged
Merged
Conversation
✅ CI PassedAll checks passed successfully against the following vllm commit: |
565f598 to
1bc419a
Compare
Collaborator
Author
|
Upstreaming the slicing for fp8 FusedSDPA. |
1bc419a to
67d4c4a
Compare
656aea9 to
64475dd
Compare
Collaborator
Author
|
@afierka-intel cc: @czhu15 |
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
2eac778 to
4d9e65f
Compare
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
FusedSDPA can be split into smaller chunks to improve performance while using the padding-aware bucketing strategy which guarantees the max absolute padding in the sequence and context dimensions.
Usage
VLLM_HPU_FSDPA_SLICE_ENABLEDTruefor padding-aware bucketing strategyVLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLDmin(max_num_batched_tokens, 8192)VLLM_HPU_FSDPA_SLICE_CHUNK_SIZEq_lenandkv_lenin each chunk. Rounded up to the next multiple of 1024.VLLM_HPU_FSDPA_SLICE_SEQ_LEN_THLD // 2VLLM_HPU_FSDPA_SLICE_WITH_GRAPH_BREAKStruefor lazy mode andfalseotherwiseImportant
These parameters are effective only with the padding-aware bucketing strategy set by
VLLM_BUCKETING_STRATEGY="pad".Implementation
Take the prefix-prefill with

[bs, query, context] = [1, 9037, 8832]as an example. The prefill shape will first be padded to[1, 10880, 11008]by the bucketing. The attention mask will be looks like:The original implementation pass the full attention mask to the FusedSDPA kernel.
This PR introduced an implementation to calculate the FSDPA in chunks by slicing the

Q,KandVas below:Where the color of the rectangles differentiate the
is_causalandattn_maskparameters passed to the FusedSDPA kernel:rgb(255,0,0):is_causal=Falseandattn_mask is not Nonergb(255,255,0):is_causal=Trueandattn_mask=Nonergb(255,0,255):is_causal=Falseandattn_mask=NoneIn this way, most of the chunks call the FusedSDPA without attention mask to get better performance, and the graph for the chunks might be reused across different buckets to reduce the warmup duration.
Dependencies
PAD_MAXfor the query and context.Thanks @Wei-Lin-Intel for the original idea and the detailed behavior of the FusedSDPA kernel.