-
Notifications
You must be signed in to change notification settings - Fork 134
Fixes from PR#1635 applied to v1.22.0_next branch #1647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
542611e to
8e07f7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base for this PR is not vllm-fork branch is it needed? if so please fix precommit issues
FusedSDPA kernel with window_size+causal only works when seq_len is multiple of SLICE_SIZE. If not, fallback to the original implementation which creates attention_mask with window_size
-Move the seq_len check for use_sdpa_window to attn_metadata -Automatically set all environment variable
Additional changes from PR1597
remove print statement
e01eae7 to
518dab2
Compare
It has now been rebased on vllm-fork v1.22.0_next after yesterday's merge of PR1616. |
This branch has now been rebased on latest vllm-fork v1.22.0_next after yesterdays' PR1616 got merged.
|
@michalkuligowski , This is ready to merge to v1.22.0_next. |
|
/run-gaudi-tests |
| setuptools>=77.0.3 | ||
| setuptools-scm>=8 | ||
| vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@5135570 | ||
| vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@009adb2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is almost identical to #1660. Please apply comments from there
|
closing this is done in #1660 |
This PR contains following changes 1. Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a few extra fixes including.. - Sliding FusedSDPA kernel, we are adding threshold variable to enable or disable to use optimized kernel. This kernel will be performance/memory benefit for longer sequence. We are providing environment variable to control per customer request. - Based on the threshold, choose different prompt bucket, if it's smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use SLICE_SIZE. - Added mark_step before SLIDING FusedSDPA is run. - Misc fixes for bucket related issue. 2. upstream fixes vllm-project#18732 vllm-project#21479 vllm-project#19788 3. optimized Gemma3RMSNorm with FusedRMSNorm Dependent on #1647 Run command with. VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024 VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1 --------- Signed-off-by: Lukas Geiger <[email protected]> Signed-off-by: Hongmin Fan <[email protected]> Co-authored-by: Henry Tang <[email protected]> Co-authored-by: Mohit Deopujari <[email protected]> Co-authored-by: Shiv Kaul <[email protected]> Co-authored-by: Shiv Kaul <[email protected]> Co-authored-by: Libin Tang <[email protected]> Co-authored-by: Lukas Geiger <[email protected]> Co-authored-by: Hongmin Fan <[email protected]> Co-authored-by: Harish Subramony <[email protected]> Co-authored-by: Jianhong-Zhang <[email protected]> Co-authored-by: Libin Tang <[email protected]> Co-authored-by: Michał Kuligowski <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.