Use FlashAttention for `multi_query_kv_attention` #4

WoosukKwon · 2023-03-02T05:05:56Z

This PR is to use FlashAttention kernels for multi_query_kv_attention, which performs masked attention for the prompt inputs.

Pros

FlashAttention is fast and memory-efficient.
FlashAttention supports 1D inputs and only invokes a single kernel to handle multiple sequences with variable lengths.

Cons

FlashAttention does NOT support FP32.
FlashAttention does not support head_size > 128. (This is fine for all models except GPT-J).
- Ref: Support for 256 head dim Dao-AILab/flash-attention#67
FlashAttention does not support attention bias (GPT-J, BLOOM, LLaMA).

Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* Init * refine

Support for optimum-intel models

…o-model-executor Adapt OpenVINO CPU plugin implementation

BA-78760: Jamba * Add support for n concat and splitting * change naming * input_metadata is a dict list now in order to pass "n" * clean up code from unecessary changes and prints * Remove kv cache allocation in case of mamba layer * Add the considerations of mamba layer cache into the num of blocks calculation * Delete mamba cache after profile * Remove prints * Cleaning * - and not _ for requirements Approved-by: Tomer Asida

patching for having type su

…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.

Long seq tmp

Add diagnostic logging to verify draft_top_p value and whether nucleus will execute. This will help diagnose why nucleus shows 32000 survivors (full vocab) instead of filtered set. Expected log output: [NUCLEUS_DEBUG] draft_top_p from config: 0.95, will run nucleus: True If we see 'will run nucleus: False', we'll know the config isn't loaded or there's a logic bug in the condition.

Bug #4 fix: Change nucleus top_p fallback from 1.0 to 0.95, add [NUCLEUS_DEBUG] diagnostic logging. This ensures nucleus runs even if config attribute is missing, preventing 32000 survivors (full vocab). Bug #5 fix: Add [SMOOTH_DEBUG] diagnostic logging for smoothing lambda. These fixes were accidentally removed during the bug #2 draft-anchored rewrite (commit 595a371). Restoring them does not affect bug #2's core algorithm - they only improve fallback behavior and diagnostics.

Update expert shapes after rebase

…llm-project#4) * created a new class for video sampling, revert original glm behavior * ruff * ruff

[Bug] Fix `libc10.so` unimported error

…tion [NIXL][Metrics] Add abstraction for per-connector Prometheus metrics

@wuhang2014

* # This is a combination of 6 commits. # This is the 1st commit message: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#2: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#3: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#4: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#5: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#6: mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> fix comments * Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Co-authored-by: Copilot <[email protected]> * Apply suggestion from @wuhang2014 line length format * Apply suggestion from @wuhang2014 remove extra empty line --------- Signed-off-by: CHEN <[email protected]> Co-authored-by: wuhang <[email protected]> Co-authored-by: Copilot <[email protected]>

…lin_experts_mxfp4 "enable early exit for fused_moe_lora""

WoosukKwon added 8 commits March 2, 2023 04:20

Add a FlashAttention test

5685bac

Define MAX_SEQ_LEN

320b20c

Minor

4932c71

Use FlashAttention for multi_query_kv_attention

c302754

Add more head sizes for test

95e5c0f

Add error msgs

1c28f4f

Enhance the server script

eb9e9a0

Add flash-attn to README

72052b7

WoosukKwon merged commit 3e9f991 into main Mar 2, 2023

WoosukKwon deleted the flash-attn branch March 2, 2023 05:13

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

tmm1 mentioned this pull request Aug 3, 2023

Fix the rushed out multi-query kernel #44

Closed

xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023

Add BigDL Llama worker for batching on decoding (vllm-project#4)

02b4cac

* Init * refine

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Use FlashAttention for multi_query_kv_attention (vllm-project#4)

e7c912b

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024

Merge pull request vllm-project#4 from slyalin/optimum_models

8a9862f

Support for optimum-intel models

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#4 from luo-cheng2021/luocheng/openvin…

658407a

…o-model-executor Adapt OpenVINO CPU plugin implementation

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024

Merge pull request vllm-project#4 from beagleski/bapatra/patching-for-su

7646e00

patching for having type su

Alexei-V-Ivanov-AMD mentioned this pull request May 16, 2024

[Speculative decoding][Re-take] Enable TP>1 speculative decoding #4840

Merged

afeldman-nm mentioned this pull request May 21, 2024

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837

Merged

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

haichuan1221 mentioned this pull request Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaocode337317439 mentioned this pull request Jun 27, 2025

[Bug]:RuntimeError: CUDA error: an illegal memory access was encountered #20170

Open

1 task

Chris113113 mentioned this pull request Jul 10, 2025

[Bug]: [V1][gpu_model_runner.py] CUDA memory error #19415

Open

1 task

shrijayan mentioned this pull request Jul 12, 2025

vLLM hangs after 10 minutes without any error message #1492

Closed

tyxiong23 mentioned this pull request Jul 30, 2025

[Bug]: GLM-4.1V-Thinking ValueError #21811

Closed

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

devops724 mentioned this pull request Aug 3, 2025

[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made #15483

Open

1 task

fernandaspets mentioned this pull request Aug 8, 2025

[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 #22479

Open

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

JeffreyWong20 mentioned this pull request Aug 19, 2025

[Bug]: [TPU] profiling_tpu/profiling.py example crashed when runs on vllm_tpu docker #23194

Closed

1 task

ruisearch42 mentioned this pull request Aug 22, 2025

[Bug]: VLLM_ALL2ALL_BACKEND=naive hangs/crashes on multi nodes when serving DeepSeekV3 #23448

Open

1 task

shaamil101-etched mentioned this pull request Aug 25, 2025

[Bug]: vLLM server timeout due to multiprocessing communication error #23582

Open

1 task

ZJY0516 mentioned this pull request Aug 31, 2025

[Bug]: CUDA error when serving MiniCPM-V model #23954

Closed

zhaoyifan222 referenced this pull request in LookAround0301/vllm Sep 11, 2025

Merge pull request #4 from LookAround0301/long_seq_tmp

9c4edf4

Long seq tmp

wyn1015 mentioned this pull request Sep 19, 2025

[Bug]: assortment of warnings / errors coming out of vllm basic python inference script #18634

Open

1 task

zhanghb55 mentioned this pull request Sep 25, 2025

[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access #25650

Open

1 task

dcmaddix referenced this pull request in dcmaddix/vllm Oct 6, 2025

Merge pull request #4 from dcmaddix/rebase_lora_pr

ee2468a

Update expert shapes after rebase

tina0852 mentioned this pull request Oct 11, 2025

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

Open

1 task

BloodAxe pushed a commit to BloodAxe/vllm that referenced this pull request Oct 17, 2025

created a new class for video sampling, revert original glm behavior (v…

31dc3e0

…llm-project#4) * created a new class for video sampling, revert original glm behavior * ruff * ruff

IwakuraRein pushed a commit to IwakuraRein/vllm that referenced this pull request Oct 21, 2025

Merge pull request vllm-project#4 from vllm-project/yewentao256-patch-2

376fba3

[Bug] Fix `libc10.so` unimported error

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

whwangovo mentioned this pull request Oct 23, 2025

[Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi #27430

Open

1 task

markmc pushed a commit to markmc/vllm that referenced this pull request Oct 24, 2025

Merge pull request vllm-project#4 from markmc/nixl-prometheus-abstrac…

f3cc4e3

…tion [NIXL][Metrics] Add abstraction for per-connector Prometheus metrics

FragranceHUST mentioned this pull request Nov 5, 2025

[Bug]: EngineCore died unexpectedly When Inference llama(generate) #23517

Open

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

access2rohit pushed a commit to access2rohit/vllm that referenced this pull request Nov 11, 2025

Merge pull request vllm-project#4 from dcmaddix/revert-3-revert-2-mar…

fb678ab

…lin_experts_mxfp4 "enable early exit for fused_moe_lora""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use FlashAttention for `multi_query_kv_attention` #4

Use FlashAttention for `multi_query_kv_attention` #4

Uh oh!

WoosukKwon commented Mar 2, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Use FlashAttention for multi_query_kv_attention #4

Use FlashAttention for multi_query_kv_attention #4

Uh oh!

Conversation

WoosukKwon commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pros

Cons

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use FlashAttention for `multi_query_kv_attention` #4

Use FlashAttention for `multi_query_kv_attention` #4

WoosukKwon commented Mar 2, 2023 •

edited

Loading