Implement `single_query_cached_kv_attention` kernel #3

WoosukKwon · 2023-03-01T21:40:49Z

This PR adds the single_query_cached_kv_attention kernel.

Supported data types:

half
float

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

Merge linear layers

* add multiple requests test * fix

Passing alibi_slopes and sliding_window to PagedAttention extension

Added dockerfile with vLLM + openvino

Geon dev

* Remove assertion * adapting jamba vllm to changes after hf release, working on weight loading in modeling file * splitting the JambaDecoderLayer to JambaMambaDecoderLayer and JambaAttentionDecoderLayer * weight loading from hf checkpoint supposedly works, might be a mixup in the MoE between the gated and non-gated weights * Add mamba from jamba modeling file * Remove slow forward * Modifications to mamba_mixer * Save changes, WIP * Fix cache placement * Debugging * Additions and logging * Jamba with mamba cache handling * Clean up * Another cleanup * Use vllm's RMSNorm instead of JambaRMSNorm, Thier implementation is with fused kernel * Clean up and orginization of the objects to handle the mamba cache * Shorten the code for kv cache mem * Move cache handling inside the Mixer * Add mamba to the wheel requirements * Add mamba to the requirements script * Add mamba_metadata * Add to __init__ __all__ * Revert 2 commits ad1a3db 'Add mamba to the requirements script' 75ed2c8 'Add mamba to the wheel requirements' * Clean up * Naming * Apply whitespace suggestions from code review * pass tie_word_embeddings to PretrainedConfig init * Replace repeat with expand as expand doesn't require more mem * Allocate really small cache if needed , don't use meta * Fix for expanded --------- Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: Erez Schwartz <[email protected]> Co-authored-by: tomeras91 <[email protected]>

fix: merge error by accident

epd clean code

ROOT CAUSE: draft_q_soft_temp=0.50 was SHARPENING the distribution instead of softening it (dividing by tau<1.0 doubles logit magnitudes). This caused nucleus to collapse to 1-2 survivors → q≈1.0 → acceptance stuck at ~0.7038 (average p_target). FIXES: 1. Config defaults (config.py, arg_utils.py): - draft_q_temp_offset: 0.15 → 0.25 (better dynamic range) - draft_q_soft_temp: 0.50 → 2.0 (SOFTENS instead of sharpens) At draft_temp=0.05: - Before: tau_q = max(0.05+0.15, 0.50) = 0.50 (2x sharper!) - After: tau_q = max(0.05+0.25, 2.0) = 2.0 (2x softer) 2. Force min_keep=2 in nucleus (eagle.py line 271): - Added keep_sorted[..., :2] = True - Prevents survivors=1 by construction (defensive programming) 3. Fix smoothing to uniform over kept set (eagle.py lines 275-287): - Before: Mixed with untempered baseline (wrong approach) - After: Uniform distribution over survivors only (correct) - Prevents q from reaching exactly 1.0 in corner cases 4. Remove dead code (eagle.py line 322): - Deleted unused self._current_sampling_metadata assignment - No longer needed with draft-anchored approach (bug #2 fix) Expected results: - tau_q ≥ 2.0 at ultracold temps → softer distribution - NUC_DEBUG: survivors = hundreds/thousands (not 1-2) - Q_DEBUG: q ∈ [0.5, 0.8] (not 0.98-1.0) - Accept rate: dynamic range restored across temp sweep

[Feature] support multi-requests

Bugfix/fps

Enhanced documentation for plugin patches: 1. Patch vllm-project#1 (Usage Tracking Helper): - Clarified as OPTIONAL (has fallback in harmony streaming patch) - Changed from "REQUIRED" to "OPTIONAL" - Explained fallback mechanism in patched_stream_method.py - Marked as upstreamable (minor utility addition) 2. Patch vllm-project#3 (Harmony Token-by-Token Streaming): - Added detailed speculative decoding context - Explained Eagle draft model generates 5-10 tokens per step - Documented specific failures with batch processing: * Tool calling broken * Multi-channel content lost * Token truncation during channel transitions - Added before/after code examples - Linked to PR vllm-project#26291 (Eagle3 Multi-Channel Streaming Fix) - Documented upstream status and removal plan Key insight: This patch exists because Eagle speculative decoding returns multiple tokens per step, and upstream's batch processing can't handle per-token channel switching. Signed-off-by: Pradyun Ramadorai <[email protected]>

[Bug] Fix Einsum in DeepGEMM tests

@wuhang2014

* # This is a combination of 6 commits. # This is the 1st commit message: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#2: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#3: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#4: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#5: mooncake store connector Signed-off-by: CHEN <[email protected]> # This is the commit message vllm-project#6: mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> * mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> mooncake store connector Signed-off-by: CHEN <[email protected]> fix comments * Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py Co-authored-by: Copilot <[email protected]> * Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py Co-authored-by: Copilot <[email protected]> * Apply suggestion from @wuhang2014 line length format * Apply suggestion from @wuhang2014 remove extra empty line --------- Signed-off-by: CHEN <[email protected]> Co-authored-by: wuhang <[email protected]> Co-authored-by: Copilot <[email protected]>

…ts_mxfp4 Revert "enable early exit for fused_moe_lora"

WoosukKwon added 14 commits February 28, 2023 12:06

[Bugfix] int -> torch.int

6f5c391

[WIP] Add attention kernel

5726be8

Fix a bug in loading QK

1202c0a

Raise an error for invalid block sizes

b5f3c49

Add FP32 flags

645a49e

Change value cache layout

ad13de6

Add single_query_cached_kv_attention

3b416c0

Add TODO

723c9fc

Move

78e048a

Minor

b7c7a60

Minor

a0fbd6b

Minor

c89e5d3

Add a test for attention ops

7d30bdc

Minor fix in comment

40a2f7b

WoosukKwon merged commit 0deacbc into main Mar 1, 2023

WoosukKwon deleted the attention-kernel branch March 1, 2023 23:11

murongweibo mentioned this pull request Jul 11, 2023

NCCL Error 5: invalid usage #427

Closed

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023

Merge pull request vllm-project#3 from ri938/merge_linear_layers

a3ac858

Merge linear layers

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023

Send multiple requests to underlying model (vllm-project#3)

e36fc39

* add multiple requests test * fix

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Implement single_query_cached_kv_attention kernel (vllm-project#3)

98a4895

Qubitium mentioned this pull request Mar 5, 2024

Generation with Prefix-cache are slower than the ones without it ? #3154

Closed

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024

Merge pull request vllm-project#3 from slyalin/window_and_alibi

ce80498

Passing alibi_slopes and sliding_window to PagedAttention extension

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 20, 2024

Merge pull request vllm-project#3 from ilya-lavrenov/docker-file

354ca31

Added dockerfile with vLLM + openvino

daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#3 from DeepAuto-AI/geon-dev

52ed876

Geon dev

sfc-gh-hazhang referenced this pull request in sfc-gh-hazhang/vllm May 7, 2024

Merge pull request #3 from Snowflake-Labs/arctic-linting

bfe5011

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaocode337317439 mentioned this pull request Jun 27, 2025

[Bug]:RuntimeError: CUDA error: an illegal memory access was encountered #20170

Open

1 task

Chris113113 mentioned this pull request Jul 10, 2025

[Bug]: [V1][gpu_model_runner.py] CUDA memory error #19415

Open

1 task

shrijayan mentioned this pull request Jul 12, 2025

vLLM hangs after 10 minutes without any error message #1492

Closed

tyxiong23 mentioned this pull request Jul 30, 2025

[Bug]: GLM-4.1V-Thinking ValueError #21811

Closed

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

devops724 mentioned this pull request Aug 3, 2025

[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made #15483

Open

1 task

fernandaspets mentioned this pull request Aug 8, 2025

[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 #22479

Open

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

bbartels pushed a commit to bbartels/vllm that referenced this pull request Aug 14, 2025

Merge pull request vllm-project#3 from hcyezhang/main

e0af39b

fix: merge error by accident

JeffreyWong20 mentioned this pull request Aug 19, 2025

[Bug]: [TPU] profiling_tpu/profiling.py example crashed when runs on vllm_tpu docker #23194

Closed

1 task

ruisearch42 mentioned this pull request Aug 22, 2025

[Bug]: VLLM_ALL2ALL_BACKEND=naive hangs/crashes on multi nodes when serving DeepSeekV3 #23448

Open

1 task

shaamil101-etched mentioned this pull request Aug 25, 2025

[Bug]: vLLM server timeout due to multiprocessing communication error #23582

Open

1 task

ZJY0516 mentioned this pull request Aug 31, 2025

[Bug]: CUDA error when serving MiniCPM-V model #23954

Closed

wyn1015 mentioned this pull request Sep 19, 2025

[Bug]: assortment of warnings / errors coming out of vllm basic python inference script #18634

Open

1 task

Bounty-hunter pushed a commit to Bounty-hunter/vllm that referenced this pull request Sep 23, 2025

Merge pull request vllm-project#3 from wuhang2014/why_epd_v_0_9_1

ecbc639

epd clean code

zhanghb55 mentioned this pull request Sep 25, 2025

[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access #25650

Open

1 task

tina0852 mentioned this pull request Oct 11, 2025

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

Open

1 task

zhangsicheng5 pushed a commit to zhangsicheng5/vllm that referenced this pull request Oct 15, 2025

Merge pull request vllm-project#3 from pisceskkk/long_seq_dev

9f8290e

[Feature] support multi-requests

BloodAxe pushed a commit to BloodAxe/vllm that referenced this pull request Oct 17, 2025

Merge pull request vllm-project#3 from tomeras91/bugfix/fps

9bc40f4

Bugfix/fps

IwakuraRein pushed a commit to IwakuraRein/vllm that referenced this pull request Oct 21, 2025

Merge pull request vllm-project#3 from vllm-project/yewentao256-patch-1

6150ab4

[Bug] Fix Einsum in DeepGEMM tests

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

Flink-ddd mentioned this pull request Oct 23, 2025

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Merged

whwangovo mentioned this pull request Oct 23, 2025

[Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi #27430

Open

1 task

FragranceHUST mentioned this pull request Nov 5, 2025

[Bug]: EngineCore died unexpectedly When Inference llama(generate) #23517

Open

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

access2rohit pushed a commit to access2rohit/vllm that referenced this pull request Nov 11, 2025

Merge pull request vllm-project#3 from dcmaddix/revert-2-marlin_exper…

a324975

…ts_mxfp4 Revert "enable early exit for fused_moe_lora"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement `single_query_cached_kv_attention` kernel #3

Implement `single_query_cached_kv_attention` kernel #3

Uh oh!

WoosukKwon commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Implement single_query_cached_kv_attention kernel #3

Implement single_query_cached_kv_attention kernel #3

Uh oh!

Conversation

WoosukKwon commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement `single_query_cached_kv_attention` kernel #3

Implement `single_query_cached_kv_attention` kernel #3