Add cache watermark to avoid frequent cache eviction #11

WoosukKwon · 2023-03-27T23:55:41Z

This PR implements the watermark mechanism to prevent frequent preemption.

If we admit new sequences such that the GPU KV cache becomes full, preemptions are highly likely to happen in the next few steps. Instead, we can reserve a small portion of the cache and refrain from utilizing the entire cache space when admitting new sequences. This will help us avoid the inefficiencies.

WoosukKwon · 2023-03-29T23:37:05Z

@zhuohan123 I'm merging this PR as it does not conflict with any other and it (slightly) improves the system performance.

* add pos_encoding impl * add benchmark and add open mp parallel

* Comments done above worker * format * fixed missing arguments * fix * format

Compress model to int8

Wenxh/fp8 on a100 v1 pr

Jiayi dev v2

Add OWNERS file

### What this PR does / why we need it? Add feature and model support matrix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test is enough Signed-off-by: wangxiyuan <[email protected]>

Abstract async saving

Fused moe lora cleanup

Issue: TPU sampler and Eagle code had two separate but related issues: 1. TPU sampler divides by zero for greedy requests (temperature=0.0) 2. Eagle code triggers mypy type errors due to missing None check Root Cause: - TPU sampler's apply_temperature() method lacks epsilon guard to prevent division by zero when temperature=0.0 (greedy sampling) - Eagle's compute_probs_and_sample_next_token() uses temperature without asserting it's not None, causing mypy type errors Impact: - TPU: Division by zero produces NaN/Inf logits, breaking speculative decoding on TPU platforms for all models using Eagle/rejection sampling - Eagle: mypy type checking failures prevent pre-commit hooks from passing Fix: 1. TPU Sampler (vllm/v1/sample/tpu/sampler.py): - Add all_random parameter to apply_temperature() method - Add epsilon guard: if not all_random: temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp) - Update call site to pass sampling_metadata.all_random 2. TPU Metadata (vllm/v1/sample/tpu/metadata.py): - Add all_random property to TPUSupportedSamplingMetadata - Populate all_random from input_batch in from_input_batch() 3. Eagle (vllm/v1/spec_decode/eagle.py): - Add assert sampling_metadata.temperature is not None after all_greedy early return - Matches sampler.py pattern (line 162) for type safety Files Modified: - vllm/v1/sample/tpu/sampler.py: Epsilon guard in apply_temperature() - vllm/v1/sample/tpu/metadata.py: Added all_random property - vllm/v1/spec_decode/eagle.py: Added temperature None assertion - CLAUDE.md: Updated modification vllm-project#11 to document fixes This addresses PR vllm-project#27077 reviewer feedback and resolves mypy type errors. Signed-off-by: Pradyun Ramadorai <[email protected]>

WoosukKwon added 3 commits March 27, 2023 23:41

Add watermark to avoid thrashing

14d10da

Fix comment

1250689

Apply watermark to can_swap_in

fe1436d

WoosukKwon requested a review from zhuohan123 March 28, 2023 08:16

WoosukKwon changed the title ~~Add cache watermark to avoid frequent preemptions~~ Add cache watermark to avoid frequent cache eviction Mar 29, 2023

WoosukKwon added 2 commits March 29, 2023 23:37

Merge branch 'main' into watermark

3707e96

Minor

c2f59a9

WoosukKwon merged commit 64e0e38 into main Mar 29, 2023

WoosukKwon deleted the watermark branch March 29, 2023 23:38

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023

add pos_encoding impl (vllm-project#11)

d32add0

* add pos_encoding impl * add benchmark and add open mp parallel

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 25, 2023

More comments (vllm-project#11)

dd60db0

* Comments done above worker * format * fixed missing arguments * fix * format

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add cache watermark to avoid frequent cache eviction (vllm-project#11)

0b05fa5

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#11 from ilya-lavrenov/int8

5bb3e35

Compress model to int8

ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024

Merge pull request vllm-project#11 from wenxcs/wenxh/fp8-on-a100-v1-pr

03e3bda

Wenxh/fp8 on a100 v1 pr

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024

Merge pull request vllm-project#11 from KuntaiDu/jiayi-dev-v2

2297c19

Jiayi dev v2

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

hteeyeoh mentioned this pull request Dec 6, 2024

[Bug]: Not able to install/compile vllm using alpine linux base image #10924

Closed

1 task

Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Jan 15, 2025

Merge pull request vllm-project#11 from vaibhavjainwiz/add_owners_file

fc8ec1a

Add OWNERS file

alokkrsahu mentioned this pull request Apr 9, 2025

[Bug]: meta-llama/Llama-4-Scout-17B-16E-Instruct compatibility #16330

Closed

1 task

njhill pushed a commit to njhill/vllm that referenced this pull request May 10, 2025

Merge pull request vllm-project#11 from njhill/abstract-async-save

70f3ed5

Abstract async saving

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Closed

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

dcmaddix pushed a commit to dcmaddix/vllm that referenced this pull request Oct 11, 2025

Merge pull request vllm-project#11 from dcmaddix/fused_moe_lora_cleanup

a931b70

Fused moe lora cleanup

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Closed

1 task

acodercat mentioned this pull request Nov 10, 2025

[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add cache watermark to avoid frequent cache eviction #11

Add cache watermark to avoid frequent cache eviction #11

Uh oh!

WoosukKwon commented Mar 27, 2023 •

edited

Loading

Uh oh!

WoosukKwon commented Mar 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add cache watermark to avoid frequent cache eviction #11

Add cache watermark to avoid frequent cache eviction #11

Uh oh!

Conversation

WoosukKwon commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Mar 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WoosukKwon commented Mar 27, 2023 •

edited

Loading