Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Mar 27, 2023

This PR implements the watermark mechanism to prevent frequent preemption.

If we admit new sequences such that the GPU KV cache becomes full, preemptions are highly likely to happen in the next few steps. Instead, we can reserve a small portion of the cache and refrain from utilizing the entire cache space when admitting new sequences. This will help us avoid the inefficiencies.

@WoosukKwon WoosukKwon requested a review from zhuohan123 March 28, 2023 08:16
@WoosukKwon WoosukKwon changed the title Add cache watermark to avoid frequent preemptions Add cache watermark to avoid frequent cache eviction Mar 29, 2023
@WoosukKwon
Copy link
Collaborator Author

@zhuohan123 I'm merging this PR as it does not conflict with any other and it (slightly) improves the system performance.

@WoosukKwon WoosukKwon merged commit 64e0e38 into main Mar 29, 2023
@WoosukKwon WoosukKwon deleted the watermark branch March 29, 2023 23:38
bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023
* add pos_encoding impl

* add benchmark and add open mp parallel
xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 25, 2023
* Comments done above worker

* format

* fixed missing arguments

* fix

* format
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024
Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Jan 15, 2025
wuhuikx pushed a commit to wuhuikx/vllm that referenced this pull request Mar 27, 2025
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Add feature and model support matrix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test is enough

Signed-off-by: wangxiyuan <[email protected]>
njhill pushed a commit to njhill/vllm that referenced this pull request May 10, 2025
dcmaddix pushed a commit to dcmaddix/vllm that referenced this pull request Oct 11, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Oct 17, 2025
Issue: TPU sampler and Eagle code had two separate but related issues:
1. TPU sampler divides by zero for greedy requests (temperature=0.0)
2. Eagle code triggers mypy type errors due to missing None check

Root Cause:
- TPU sampler's apply_temperature() method lacks epsilon guard to prevent
  division by zero when temperature=0.0 (greedy sampling)
- Eagle's compute_probs_and_sample_next_token() uses temperature without
  asserting it's not None, causing mypy type errors

Impact:
- TPU: Division by zero produces NaN/Inf logits, breaking speculative
  decoding on TPU platforms for all models using Eagle/rejection sampling
- Eagle: mypy type checking failures prevent pre-commit hooks from passing

Fix:
1. TPU Sampler (vllm/v1/sample/tpu/sampler.py):
   - Add all_random parameter to apply_temperature() method
   - Add epsilon guard: if not all_random: temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp)
   - Update call site to pass sampling_metadata.all_random

2. TPU Metadata (vllm/v1/sample/tpu/metadata.py):
   - Add all_random property to TPUSupportedSamplingMetadata
   - Populate all_random from input_batch in from_input_batch()

3. Eagle (vllm/v1/spec_decode/eagle.py):
   - Add assert sampling_metadata.temperature is not None after all_greedy early return
   - Matches sampler.py pattern (line 162) for type safety

Files Modified:
- vllm/v1/sample/tpu/sampler.py: Epsilon guard in apply_temperature()
- vllm/v1/sample/tpu/metadata.py: Added all_random property
- vllm/v1/spec_decode/eagle.py: Added temperature None assertion
- CLAUDE.md: Updated modification vllm-project#11 to document fixes

This addresses PR vllm-project#27077 reviewer feedback and resolves mypy type errors.

Signed-off-by: Pradyun Ramadorai <[email protected]>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Oct 17, 2025
Issue: TPU sampler and Eagle code had two separate but related issues:
1. TPU sampler divides by zero for greedy requests (temperature=0.0)
2. Eagle code triggers mypy type errors due to missing None check

Root Cause:
- TPU sampler's apply_temperature() method lacks epsilon guard to prevent
  division by zero when temperature=0.0 (greedy sampling)
- Eagle's compute_probs_and_sample_next_token() uses temperature without
  asserting it's not None, causing mypy type errors

Impact:
- TPU: Division by zero produces NaN/Inf logits, breaking speculative
  decoding on TPU platforms for all models using Eagle/rejection sampling
- Eagle: mypy type checking failures prevent pre-commit hooks from passing

Fix:
1. TPU Sampler (vllm/v1/sample/tpu/sampler.py):
   - Add all_random parameter to apply_temperature() method
   - Add epsilon guard: if not all_random: temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp)
   - Update call site to pass sampling_metadata.all_random

2. TPU Metadata (vllm/v1/sample/tpu/metadata.py):
   - Add all_random property to TPUSupportedSamplingMetadata
   - Populate all_random from input_batch in from_input_batch()

3. Eagle (vllm/v1/spec_decode/eagle.py):
   - Add assert sampling_metadata.temperature is not None after all_greedy early return
   - Matches sampler.py pattern (line 162) for type safety

Files Modified:
- vllm/v1/sample/tpu/sampler.py: Epsilon guard in apply_temperature()
- vllm/v1/sample/tpu/metadata.py: Added all_random property
- vllm/v1/spec_decode/eagle.py: Added temperature None assertion
- CLAUDE.md: Updated modification vllm-project#11 to document fixes

This addresses PR vllm-project#27077 reviewer feedback and resolves mypy type errors.

Signed-off-by: Pradyun Ramadorai <[email protected]>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Oct 17, 2025
Issue: TPU sampler and Eagle code had two separate but related issues:
1. TPU sampler divides by zero for greedy requests (temperature=0.0)
2. Eagle code triggers mypy type errors due to missing None check

Root Cause:
- TPU sampler's apply_temperature() method lacks epsilon guard to prevent
  division by zero when temperature=0.0 (greedy sampling)
- Eagle's compute_probs_and_sample_next_token() uses temperature without
  asserting it's not None, causing mypy type errors

Impact:
- TPU: Division by zero produces NaN/Inf logits, breaking speculative
  decoding on TPU platforms for all models using Eagle/rejection sampling
- Eagle: mypy type checking failures prevent pre-commit hooks from passing

Fix:
1. TPU Sampler (vllm/v1/sample/tpu/sampler.py):
   - Add all_random parameter to apply_temperature() method
   - Add epsilon guard: if not all_random: temp = torch.where(temp < _SAMPLING_EPS, 1.0, temp)
   - Update call site to pass sampling_metadata.all_random

2. TPU Metadata (vllm/v1/sample/tpu/metadata.py):
   - Add all_random property to TPUSupportedSamplingMetadata
   - Populate all_random from input_batch in from_input_batch()

3. Eagle (vllm/v1/spec_decode/eagle.py):
   - Add assert sampling_metadata.temperature is not None after all_greedy early return
   - Matches sampler.py pattern (line 162) for type safety

Files Modified:
- vllm/v1/sample/tpu/sampler.py: Epsilon guard in apply_temperature()
- vllm/v1/sample/tpu/metadata.py: Added all_random property
- vllm/v1/spec_decode/eagle.py: Added temperature None assertion
- CLAUDE.md: Updated modification vllm-project#11 to document fixes

This addresses PR vllm-project#27077 reviewer feedback and resolves mypy type errors.

Signed-off-by: Pradyun Ramadorai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants