-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Use FlashAttention for multi_query_kv_attention
#4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
xiangyuT
added a commit
to xiangyuT/vllm
that referenced
this pull request
Oct 24, 2023
hongxiayang
pushed a commit
to hongxiayang/vllm
that referenced
this pull request
Feb 13, 2024
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 12, 2024
Support for optimum-intel models
luo-cheng2021
pushed a commit
to luo-cheng2021/vllm
that referenced
this pull request
Mar 25, 2024
…o-model-executor Adapt OpenVINO CPU plugin implementation
mzusman
pushed a commit
to mzusman/vllm
that referenced
this pull request
Apr 16, 2024
BA-78760: Jamba * Add support for n concat and splitting * change naming * input_metadata is a dict list now in order to pass "n" * clean up code from unecessary changes and prints * Remove kv cache allocation in case of mamba layer * Add the considerations of mamba layer cache into the num of blocks calculation * Delete mamba cache after profile * Remove prints * Cleaning * - and not _ for requirements Approved-by: Tomer Asida
linxihui
added a commit
to linxihui/vllm
that referenced
this pull request
May 14, 2024
patching for having type su
yukavio
pushed a commit
to yukavio/vllm
that referenced
this pull request
Jul 3, 2024
…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
yukavio
pushed a commit
to yukavio/vllm
that referenced
this pull request
Jul 3, 2024
…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
1 task
1 task
1 task
1 task
1 task
1 task
1 task
1 task
1 task
1 task
1 task
1 task
yuz207
referenced
this pull request
in IluvatarLabs/vllm
Sep 30, 2025
Add diagnostic logging to verify draft_top_p value and whether nucleus will execute. This will help diagnose why nucleus shows 32000 survivors (full vocab) instead of filtered set. Expected log output: [NUCLEUS_DEBUG] draft_top_p from config: 0.95, will run nucleus: True If we see 'will run nucleus: False', we'll know the config isn't loaded or there's a logic bug in the condition.
yuz207
referenced
this pull request
in IluvatarLabs/vllm
Sep 30, 2025
Bug #4 fix: Change nucleus top_p fallback from 1.0 to 0.95, add [NUCLEUS_DEBUG] diagnostic logging. This ensures nucleus runs even if config attribute is missing, preventing 32000 survivors (full vocab). Bug #5 fix: Add [SMOOTH_DEBUG] diagnostic logging for smoothing lambda. These fixes were accidentally removed during the bug #2 draft-anchored rewrite (commit 595a371). Restoring them does not affect bug #2's core algorithm - they only improve fallback behavior and diagnostics.
dcmaddix
referenced
this pull request
in dcmaddix/vllm
Oct 6, 2025
Update expert shapes after rebase
1 task
BloodAxe
pushed a commit
to BloodAxe/vllm
that referenced
this pull request
Oct 17, 2025
…llm-project#4) * created a new class for video sampling, revert original glm behavior * ruff * ruff
IwakuraRein
pushed a commit
to IwakuraRein/vllm
that referenced
this pull request
Oct 21, 2025
[Bug] Fix `libc10.so` unimported error
1 task
1 task
markmc
pushed a commit
to markmc/vllm
that referenced
this pull request
Oct 24, 2025
…tion [NIXL][Metrics] Add abstraction for per-connector Prometheus metrics
Bounty-hunter
pushed a commit
to Bounty-hunter/vllm
that referenced
this pull request
Nov 4, 2025
* # This is a combination of 6 commits. # This is the 1st commit message: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#2: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#3: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#4: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#5: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#6: mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> fix comments * Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Co-authored-by: Copilot <[email protected]> * Apply suggestion from @wuhang2014 line length format * Apply suggestion from @wuhang2014 remove extra empty line --------- Signed-off-by: CHEN <[email protected]> Co-authored-by: wuhang <[email protected]> Co-authored-by: Copilot <[email protected]>
1 task
4 tasks
access2rohit
pushed a commit
to access2rohit/vllm
that referenced
this pull request
Nov 11, 2025
…lin_experts_mxfp4 "enable early exit for fused_moe_lora""
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is to use FlashAttention kernels for
multi_query_kv_attention, which performs masked attention for the prompt inputs.Pros
Cons
Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.
Tested models:
Tested GPUs: