Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Mar 2, 2023

This PR is to use FlashAttention kernels for multi_query_kv_attention, which performs masked attention for the prompt inputs.

Pros

  • FlashAttention is fast and memory-efficient.
  • FlashAttention supports 1D inputs and only invokes a single kernel to handle multiple sequences with variable lengths.

Cons

Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon merged commit 3e9f991 into main Mar 2, 2023
@WoosukKwon WoosukKwon deleted the flash-attn branch March 2, 2023 05:13
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
…o-model-executor

Adapt OpenVINO CPU plugin implementation
mzusman pushed a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
BA-78760: Jamba

* Add support for n concat and splitting

* change naming

* input_metadata is a dict list now in order to pass "n"

* clean up code from unecessary changes and prints

* Remove kv cache allocation in case of mamba layer

* Add the considerations of mamba layer cache into the num of blocks
calculation

* Delete mamba cache after profile

* Remove prints

* Cleaning

* - and not _ for requirements

Approved-by: Tomer Asida
linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…ect#4

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…ect#4

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
zhaoyifan222 referenced this pull request in LookAround0301/vllm Sep 11, 2025
yuz207 referenced this pull request in IluvatarLabs/vllm Sep 30, 2025
Add diagnostic logging to verify draft_top_p value and whether nucleus
will execute.

This will help diagnose why nucleus shows 32000 survivors (full vocab)
instead of filtered set.

Expected log output:
[NUCLEUS_DEBUG] draft_top_p from config: 0.95, will run nucleus: True

If we see 'will run nucleus: False', we'll know the config isn't loaded
or there's a logic bug in the condition.
yuz207 referenced this pull request in IluvatarLabs/vllm Sep 30, 2025
Bug #4 fix: Change nucleus top_p fallback from 1.0 to 0.95, add
[NUCLEUS_DEBUG] diagnostic logging. This ensures nucleus runs even if
config attribute is missing, preventing 32000 survivors (full vocab).

Bug #5 fix: Add [SMOOTH_DEBUG] diagnostic logging for smoothing lambda.

These fixes were accidentally removed during the bug #2 draft-anchored
rewrite (commit 595a371). Restoring them does not affect bug #2's
core algorithm - they only improve fallback behavior and diagnostics.
dcmaddix referenced this pull request in dcmaddix/vllm Oct 6, 2025
Update expert shapes after rebase
BloodAxe pushed a commit to BloodAxe/vllm that referenced this pull request Oct 17, 2025
…llm-project#4)

* created a new class for video sampling, revert original glm behavior

* ruff

* ruff
IwakuraRein pushed a commit to IwakuraRein/vllm that referenced this pull request Oct 21, 2025
markmc pushed a commit to markmc/vllm that referenced this pull request Oct 24, 2025
…tion

[NIXL][Metrics] Add abstraction for per-connector Prometheus metrics
Bounty-hunter pushed a commit to Bounty-hunter/vllm that referenced this pull request Nov 4, 2025
* # This is a combination of 6 commits.
# This is the 1st commit message:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#2:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#3:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#4:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#5:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

# This is the commit message vllm-project#6:

mooncake store connector

Signed-off-by: CHEN <[email protected]>

* mooncake store connector

Signed-off-by: CHEN <[email protected]>

* mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

mooncake store connector

Signed-off-by: CHEN <[email protected]>

fix comments

* Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py

Co-authored-by: Copilot <[email protected]>

* Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py

Co-authored-by: Copilot <[email protected]>

* Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py

Co-authored-by: Copilot <[email protected]>

* Apply suggestion from @wuhang2014

line length format

* Apply suggestion from @wuhang2014

remove extra empty line

---------

Signed-off-by: CHEN <[email protected]>
Co-authored-by: wuhang <[email protected]>
Co-authored-by: Copilot <[email protected]>
access2rohit pushed a commit to access2rohit/vllm that referenced this pull request Nov 11, 2025
…lin_experts_mxfp4

 "enable early exit for fused_moe_lora""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants