[Fix] prefix cache hit rate == 0 bug with gpt-oss style models#33524
[Fix] prefix cache hit rate == 0 bug with gpt-oss style models#33524heheda12345 merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for a prefix cache hit rate bug affecting gpt-oss style models when EAGLE speculative decoding is enabled. The core change in vllm/v1/core/kv_cache_coordinator.py correctly identifies these simple hybrid models and bypasses the iterative convergence loop for cache hit length, which was causing incorrect multiple applications of the EAGLE block dropping logic. This is a targeted and effective fix. The accompanying changes in tests/v1/core/test_prefix_caching.py are substantial, involving refactoring existing tests for better structure and adding new, thorough test cases for the EAGLE-enabled hybrid model scenario. The overall implementation is sound and well-tested.
|
Related Documentation No published documentation to review for changes on this repository. |
…eagle spiral drop Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
fd23877 to
47c12da
Compare
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Thank you very much.
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> (cherry picked from commit a01ef3f)
…project#33524) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Pai <416932041@qq.com>
…project#33524) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Purpose
With the same purpose as PR #33270, this PR is another simple workaround for issue #32802.
This PR checks GPT-oss style models, which consist of 1 Full Attn group and 1 SWA group, and handles it as a special case where the while loop for convergence check is unnecessary. This addresses the EAGLE spiral block drop bug, and also helps slightly with the efficiency because the while loop is not needed for such simple hybrid models anyway.
However, it is worth noting that for more complicated models with multiple attention groups, this PR does not fully address the EAGLE spiral block drop issue either. A general fix to this issue cannot directly cache the hit_blocks list returned by each attention type, because SWA attn and Mamba-style attn do not follow the downward-closed property (cache hit at token j does not indicate cache hit at i where i < j). So we need some more fundamental changes there.
Fortunately, we don't have such complex models yet, so this is not a huge issue for now.
Test Plan
The test case is adopted from PR #33270, but removes the complicated cases that enable EAGLE for complex models with multiple attn groups.
pytest -q tests/v1/core/test_prefix_caching.pyTest Result
Passed.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.