Skip to content

[Fix] prefix cache hit rate == 0 bug with gpt-oss style models#33524

Merged
heheda12345 merged 3 commits intovllm-project:mainfrom
ivanium:fix-longest-prefix-cache
Feb 2, 2026
Merged

[Fix] prefix cache hit rate == 0 bug with gpt-oss style models#33524
heheda12345 merged 3 commits intovllm-project:mainfrom
ivanium:fix-longest-prefix-cache

Conversation

@ivanium
Copy link
Copy Markdown
Contributor

@ivanium ivanium commented Feb 1, 2026

Purpose

With the same purpose as PR #33270, this PR is another simple workaround for issue #32802.

This PR checks GPT-oss style models, which consist of 1 Full Attn group and 1 SWA group, and handles it as a special case where the while loop for convergence check is unnecessary. This addresses the EAGLE spiral block drop bug, and also helps slightly with the efficiency because the while loop is not needed for such simple hybrid models anyway.

However, it is worth noting that for more complicated models with multiple attention groups, this PR does not fully address the EAGLE spiral block drop issue either. A general fix to this issue cannot directly cache the hit_blocks list returned by each attention type, because SWA attn and Mamba-style attn do not follow the downward-closed property (cache hit at token j does not indicate cache hit at i where i < j). So we need some more fundamental changes there.

Fortunately, we don't have such complex models yet, so this is not a huge issue for now.

Test Plan

The test case is adopted from PR #33270, but removes the complicated cases that enable EAGLE for complex models with multiple attn groups.

pytest -q tests/v1/core/test_prefix_caching.py

Test Result

Passed.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added gpt-oss Related to GPT-OSS models v1 bug Something isn't working labels Feb 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for a prefix cache hit rate bug affecting gpt-oss style models when EAGLE speculative decoding is enabled. The core change in vllm/v1/core/kv_cache_coordinator.py correctly identifies these simple hybrid models and bypasses the iterative convergence loop for cache hit length, which was causing incorrect multiple applications of the EAGLE block dropping logic. This is a targeted and effective fix. The accompanying changes in tests/v1/core/test_prefix_caching.py are substantial, involving refactoring existing tests for better structure and adding new, thorough test cases for the EAGLE-enabled hybrid model scenario. The overall implementation is sound and well-tested.

@dosubot
Copy link
Copy Markdown

dosubot bot commented Feb 1, 2026

Related Documentation

No published documentation to review for changes on this repository.

Write your first living document

How did I do? Any feedback?  Join Discord

…eagle spiral drop

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
@ivanium ivanium force-pushed the fix-longest-prefix-cache branch from fd23877 to 47c12da Compare February 1, 2026 23:55
Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you very much.

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Feb 2, 2026
@heheda12345 heheda12345 enabled auto-merge (squash) February 2, 2026 00:12
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 2, 2026
@heheda12345 heheda12345 merged commit a01ef3f into vllm-project:main Feb 2, 2026
42 checks passed
@ivanium ivanium deleted the fix-longest-prefix-cache branch February 2, 2026 02:00
khluu pushed a commit that referenced this pull request Feb 2, 2026
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
(cherry picked from commit a01ef3f)
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…project#33524)

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Pai <416932041@qq.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants