[Bugfix] Fix Hybrid KV cache hit length computation for eagle#33270
[Bugfix] Fix Hybrid KV cache hit length computation for eagle#33270xyang16 wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Xin Yang <xyangx@amazon.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a performance regression caused by an infinite loop in cache hit computation for hybrid models using Eagle speculative decoding. The fix involves breaking the iterative refinement loop after one pass when Eagle is enabled, which correctly resolves the issue for hybrid attention models. My review focuses on the potential for silent correctness issues when Mamba layers are also present, as the Eagle-specific logic is not applied consistently across all layer types.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
Hi @xyang16, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
Hi @xyang16, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @xyang16, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
3219b8d to
63e16e4
Compare
| # (downward-closed property) | ||
| cached_blocks = hit_blocks_by_group[group_ids[0]] | ||
| if is_full_attn and cached_blocks is not None: | ||
| if (is_full_attn or self.use_eagle) and cached_blocks is not None: |
There was a problem hiding this comment.
What about proceeding with this?https://vllm-dev.slack.com/archives/C07RFT2UT16/p1769063932776689?thread_ts=1769013739.827459&cid=C07RFT2UT16
There was a problem hiding this comment.
Thanks for review! I will try this.
There was a problem hiding this comment.
I just tried the approach, it still got the same problem, curr_hit_length is reduced by block_size in every while loop until it reaches 0. Also curr_hit_length doesn't match the length of hit_blocks anymore.
if self.use_eagle:
hit_blocks = manager_cls.find_longest_cache_hit(
block_hashes=_get_block_hashes(spec),
max_length=curr_hit_length + spec.block_size,
kv_cache_group_ids=group_ids,
block_pool=self.block_pool,
kv_cache_spec=spec,
use_eagle=self.use_eagle,
alignment_tokens=self.lcm_block_size,
)
curr_hit_length = max(0, len(hit_blocks[0]) * spec.block_size - spec.block_size)
Please let me know how you think. Thanks!
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
We observed a 20% performance regression for gpt-oss with eagle in 0.14.0 release. We found it's caused by 0% prefix cache hit rate.
curr_hit_lengthis reduced by block_size andcurr_hit_length < hit_lengthis always true here, untilhit_lengthreach 0. So the returnedhit_blocksbecome empty and returnedhit_lengthbecome 0. This makes prefix cache hit rate always 0%.Test Plan
Test Result
Unit tests passed.
Benchmarking
Main:
PR:
Output token throughput improves 20%.
Accuracy Testing
Main:
PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.